Python Machine Learning 2
Python Machine Learning 2
Chapter 1 Introduction
1. Purpose
2. About the Execution Environment for Source Code
Chapter 2 For beginners
1. Simple Linear Regression with Synthetic Data to Predict Product
Pricing
2. Logistic Regression Visualization for Predicting Customer Purchase
3. K-Means Clustering: Customer Segmentation Problem
4. Decision Tree Classification for Customer Churn Prediction
5. Random Forest Feature Importance Analysis for Customer Churn
Prediction
6. Support Vector Machine Classification and Visualization Problem
7. Principal Component Analysis with Visualization for Customer Data
Analysis
8. Confusion Matrix Visualization in Model Evaluation
9. ROC Curve Analysis for Binary Classifier Performance
10. Time Series Data Generation and Basic Line Plot for Machine
Learning
11. Heatmap of Correlation Matrix for Customer Sales Data Analysis
12. Histogram of Feature Distributions for Machine Learning Model
Inputs
13. Box Plot Outlier Detection in Customer Sales Data
14. Data Normalization and Visual Comparison for Machine Learning
15. Creating and Visualizing Interaction Terms in Machine Learning
16. Visualizing Customer Preferences using t-SNE
17. Polynomial Feature Generation and Visualization in Machine
Learning
18. Custom Loss Function Visualization for a Manufacturing
Optimization Problem
19. Cross-Validation Results Visualization in Machine Learning
20. Visualizing Decision Boundaries of Machine Learning Classifiers
21. Gradient Descent Optimization Path Visualization in Machine
Learning
22. Clustering Evaluation Using Silhouette Score and Visualization
23. Comparison of Min-Max Scaling and Standardization in Machine
Learning
24. Visualizing Model Learning Curves in Python
25. Class Distribution Visualization Before and After Applying SMOTE
26. Visualizing Feature Importance in Machine Learning Models
27. Understanding Overfitting and Underfitting with Learning Curves in
Machine Learning
Chapter 3 For advanced
1. DBSCAN Clustering and Visualization for Customer Segmentation
2. Visualizing the Effectiveness of Hyperparameter Tuning in Machine
Learning Models
3. Bias-Variance Tradeoff in Predictive Modeling
4. Visualizing the Impact of Data Augmentation on Image Classification
5. Visualizing Multicollinearity Using VIF in a Marketing Dataset
6. Feature Selection Techniques and Their Impact on Model
Performance
7. Time Series Decomposition for Sales Forecasting
8. Building and Visualizing a Movie Recommendation System using
Collaborative Filtering
9. Exploratory Data Analysis Using Facet Grids for Customer
Segmentation
10. Creating Synthetic Data for Predicting House Prices
11. Understanding the Impact of Evaluation Metrics on Model Selection
12. Outlier Detection Techniques in Customer Data
13. Visualizing Cross-Validation Folds in Machine Learning
14. Visualizing Clustering Results with Centroids in Python
15. Visualizing Data Drift in Machine Learning Models with Feature
Comparison
16. Visualizing the Impact of Feature Engineering on Classification
Model Performance
17. Creating a Custom Metric and Visualizing Model Performance
18. Visualizing Class Imbalance in Binary Classification with Python
19. Visualizing Predictions vs Actual Values Using Linear Regression
20. Visualizing Feature Scaling Techniques for Machine Learning
21. Creating a Simple Recommender System Visualization Based on
Customer Preferences
22. Evaluating Clustering Performance Using Davies-Bouldin Index
23. Visualizing Data Pipeline Workflow with Machine Learning
Integration
24. Visualizing Feature Interactions in a Predictive Model
25. Visualizing the Impact of Regularization Techniques on Overfitting
in a Machine Learning Model
26. Ensemble Learning: Visualizing Decision Boundaries in Bagging
and Boosting Methods
27. Visualizing ROC AUC for Multi-Class Classification Using Python
28. Visualizing Word Embeddings Using PCA
29. Visualizing Sequential Data using Recurrent Neural Networks
(RNNs)
30. Visualizing the Impact of Activation Functions on Neural Network
Output
31. Visualizing Forecasting Models Using Linear Regression and Time
Series Data
32. Visualizing the Impact of Dropout in a Neural Network for Customer
Churn Prediction
33. Visualizing Error Distributions in Machine Learning Models
34. Visualizing Feature Distributions After Scaling Transformation
35. Visualizing Customer Decision-Making with Decision Trees
36. Evaluating Regression Models with Residual Plots to Improve
Predictions
37. PCA and Biplots for Customer Data Analysis
38. Visualizing Neural Network Training History Using Matplotlib
39. Building and Visualizing a Multi-Output Regression Model for
Predicting House Prices and Rental Rates
40. Comparative Performance of Machine Learning Algorithms on a
Sales Prediction Dataset
41. Evaluating Clustering Algorithm Performance on Customer
Segmentation
42. Visualizing Decision Thresholds in Binary Classification
43. Visualizing Feature Importances in Decision Tree Models
44. Visualizing Confusion Matrices for Multi-Class Classification
Problems
45. Visualizing the Distribution of Customer Product Preferences
46. Visualizing Steps in a Machine Learning Data Pipeline
47. Comparing Model Performance Metrics: A Practical Approach
48. Visualizing the Effect of Noise on Model Performance in Machine
Learning
49. Creating a Classifier Using Ensemble Methods with Visualization
50. Visualizing Data Generation for Binary Classification Problems
51. Visualizing Time Series Model Performance Using Python
52. Visualizing Text Classification Results Using a Confusion Matrix
53. Impact of Sample Size on Model Accuracy in Predicting Customer
Purchases
54. Visualizing Learning Curves for Machine Learning Models
55. Visualizing the Sensitivity of Machine Learning Model Predictions
56. Visualizing the Impact of Feature Engineering on Classification
Performance
57. Creating and Visualizing a Neural Network for Customer Image
Classification
58. Visualizing Aggregated Predictions from Multiple Machine Learning
Models
59. Visualizing the Relationship Between Features and a Target in a
Customer Churn Problem
60. Visualizing Model Training Accuracy and Loss Over Time
61. Visualizing Residuals in a Regression Model for Business Sales
Prediction
62. Visualizing K-Means Clustering with Customer Purchase Data
63. Visualizing Time Series Predictions Using Machine Learning
64. Visualizing Model Predictions for House Price Prediction
65. Visualizing Outliers with Boxplots in Machine Learning Data
66. Visualizing Feature Importance in a Customer Churn Prediction
Model
67. Visualizing Reinforcement Learning Agent's Training Performance
in a Maze
68. Visualizing the Impact of Different Kernel Functions in SVM
69. Visualizing Data Preprocessing in Machine Learning
70. Visualizing the Impact of Hyperparameters on a Decision Tree
Classifier
71. Detecting Anomalies in Time Series Data Using Machine Learning
72. Visualizing the Voting Classifier for Customer Churn Prediction
73. Visualizing the Impact of Train-Test Splits in Machine Learning
74. Using Violin Plots to Visualize Categorical Data in Customer
Satisfaction
75. Visualizing Feature Selection Process in Machine Learning
76. Visualizing Hyperparameter Optimization for a Customer
Classification Model
77. Visualizing Model Performance Across Data Subsets in a Sales
Prediction Scenario
78. Visualizing the Impact of Neural Network Layers on Model
Accuracy
79. Visualizing the Gradient Boosting Process for Customer Churn
Prediction
80. Visualizing the Outputs of a Simple Recurrent Neural Network
(RNN) for Time-Series Forecasting
81. Visualizing Data Transformation in Machine Learning
82. Visualizing the Impact of Feature Scaling in Machine Learning
83. Visualizing Cross-Validation Score Distributions in Machine
Learning Models
84. Visualizing Algorithmic Complexity in Machine Learning
85. Visualizing the Impact of Hyperparameter Tuning on Model
Accuracy
86. Visualizing a Decision Tree for Customer Churn Prediction
87. Visualizing Forecasting Errors Using Machine Learning
88. Visualizing the Results of a Machine Learning Classification Task
Chapter 4 Request for review evaluation
Appendix: Execution Environment
Chapter 1 Introduction
1. Purpose
This book is designed for those who already have a basic understanding of
programming and want to dive into Python-based machine learning through
hands-on practice.
With 100 targeted exercises, it provides a structured approach to developing
and refining your skills.Each exercise includes clear source code and visual
output, making it easier to grasp complex concepts.
Detailed explanations accompany every solution, helping you to not only
see how the code works but also why it works.Whether you're on your
commute or have a few spare moments, simply reading through the
exercises can expand your knowledge.
Running the code yourself will deepen your understanding further.
This format is ideal for anyone looking to strengthen their grasp of machine
learning by actively working through problems and solutions.Enjoy the
journey as you explore Python and machine learning through practical,
visual examples.
2. About the Execution Environment for Source
Code
For information on the execution environment used for the source code in
this book, please refer to the appendix at the end of the book.
Chapter 2 For beginners
1. Simple Linear Regression with Synthetic Data to
Predict Product Pricing
Importance★★★★★
Difficulty★★★☆☆
A company is analyzing the relationship between advertising costs and
product prices to determine if there is a correlation that could help optimize
pricing strategies.Your task is to build a simple linear regression model to
predict product prices based on advertising spending.Using the synthetic
data provided in the code, you need to:Create synthetic data representing
advertising costs as the independent variable (X) and product prices as the
dependent variable (y).Fit a simple linear regression model.Visualize the
data points and the linear regression line.
【Data Generation Code Example】
import numpy as np
【Code Answer】
import numpy as np
model = LinearRegression()
model.fit(X, y)
y_pred = model.predict(X)
plt.show()
X = np.arange(1, 11).reshape(-1, 1)
y = np.array([0, 0, 0, 1, 0, 1, 1, 1, 1, 1])
【Diagram Answer】
【Code Answer】
import numpy as np
X = np.arange(1, 11).reshape(-1, 1)
y = np.array([0, 0, 0, 1, 0, 1, 1, 1, 1, 1])
# Initialize and train logistic regression model
model = LogisticRegression()
model.fit(X, y)
# Predict probabilities
y_prob = model.predict_proba(X_test)[:, 1]
plt.ylabel('Probability of Purchase')
plt.show()
np.random.seed(42)
【Code Answer】
import numpy as np
np.random.seed(42)
# K-Means clustering
kmeans = KMeans(n_clusters=3)
clusters = kmeans.fit_predict(X)
# Visualization
plt.legend()
plt.show()
import pandas as pd
data = pd.DataFrame({
})
X = data[['age', 'monthly_charges', 'contract_duration']]
y = data['churn']
【Code Answer】
# # Import necessary libraries
import numpy as np
import pandas as pd
data = pd.DataFrame({
})
y = data['churn']
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)
plt.figure(figsize=(12, 8))
plt.show()
import pandas as pd
np.random.seed(42)
n_customers = 1000
#Create DataFrame
【Code Answer】
import numpy as np
import pandas as pd
np.random.seed(42)
n_customers = 1000
age = np.random.randint(18, 70, n_customers)
y = data['Churn']
clf.fit(X_train, y_train)
importances = clf.feature_importances_
features = X.columns
plt.figure(figsize=(8, 6))
plt.barh(features, importances, color='skyblue')
plt.xlabel('Feature Importance')
plt.show()
【Code Answer】
import numpy as np
model.fit(X, y)
ax = plt.gca()
xlim = ax.get_xlim()
ylim = ax.get_ylim()
xy = np.vstack([XX.ravel(), YY.ravel()]).T
Z = model.decision_function(xy).reshape(XX.shape)
plt.xlabel('Customer Age')
plt.ylabel('Customer Income')
plt.show()
plot_decision_boundary(X, y, model)
X, _ = make_classification(n_samples=100, n_features=5,
n_informative=3, n_classes=3)
【Diagram Answer】
【Code Answer】
import numpy as np
np.random.seed(0)
X, _ = make_classification(n_samples=100, n_features=5,
n_informative=3, n_classes=3)
# # Perform PCA to reduce dimensions
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
plt.grid(True)
plt.show()
X, y = make_classification(n_samples=100, n_features=2,
n_informative=2, n_redundant=0, random_state=42)
【Diagram Answer】
【Code Answer】
#import necessary libraries
#create data
X, y = make_classification(n_samples=100, n_features=2,
n_informative=2, n_redundant=0, random_state=42)
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
ConfusionMatrixDisplay(confusion_matrix=cm).plot()
plt.title('Confusion Matrix')
plt.show()
clf_old = LogisticRegression(random_state=42)
clf_old.fit(X_train, y_train)
y_prob_old = clf_old.predict_proba(X_test)[:, 1]
clf_new = LogisticRegression(random_state=24)
clf_new.fit(X_train, y_train)
y_prob_new = clf_new.predict_proba(X_test)[:, 1]
【Diagram Answer】
【Code Answer】
import numpy as np
clf_old = LogisticRegression(random_state=42)
clf_old.fit(X_train, y_train)
y_prob_old = clf_old.predict_proba(X_test)[:, 1]
clf_new = LogisticRegression(random_state=24)
clf_new.fit(X_train, y_train)
y_prob_new = clf_new.predict_proba(X_test)[:, 1]
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.legend(loc='lower right')
plt.show()
import pandas as pd
【Code Answer】
import numpy as np
import pandas as pd
This task involves generating random time-series data for monthly sales and
visualizing it using a line plot.First, numpy is used to create an array of 12
months and randomly generate sales data for each month between 100 and
1000. This synthetic data mimics real-world sales trends.We then use the
matplotlib library to plot the data. The plot() function creates the line chart,
where months are plotted on the x-axis and sales on the y-axis.The
marker='o' option adds circles at each data point, making it easier to see
individual values. The label argument provides a label for the data line,
which is later shown using plt.legend().Additionally, the xlabel() and
ylabel() functions add meaningful axis labels to the chart, and the title()
function adds a title.The xticks() ensures that all months from 1 to 12 are
displayed on the x-axis. The grid(True) adds a grid to the plot, making it
easier to interpret the data visually. Finally, show() is used to display the
resulting chart.In machine learning, such visualizations are crucial for
understanding the underlying patterns in time-series data before model
training. Trends, seasonal variations, and outliers can be identified easily
with such plots. This step often precedes more complex analyses, like
training predictive models to forecast future sales.
【Trivia】
Line plots are commonly used in time-series analysis because they
effectively visualize changes over continuous intervals.
11. Heatmap of Correlation Matrix for Customer
Sales Data Analysis
Importance★★★★☆
Difficulty★★★☆☆
A retail company wants to understand the relationships between various
sales metrics such as total sales, customer visits, marketing spend, and
advertising channels. You are tasked with analyzing the correlations
between these metrics to help the company optimize its marketing
strategies.Create a heatmap to visualize the correlation matrix of these
metrics using Python, and provide insights into how different variables are
related to one another.The dataset should include the following
columns:Total_Sales: Total revenue generated from sales.Customer_Visits:
Number of customers visiting the store.Marketing_Spend: Amount of
money spent on marketing campaigns.Online_Ads: Spend on online
advertising channels.TV_Ads: Spend on TV advertising channels.Analyze
this dataset and generate a correlation heatmap.
【Data Generation Code Example】
import pandas as pd
import numpy as np
data = {
df = pd.DataFrame(data)
【Diagram Answer】
【Code Answer】
# # Importing necessary libraries
import pandas as pd
import numpy as np
df = pd.DataFrame(data)
plt.figure(figsize=(8,6))
sns.heatmap(corr_matrix, annot=True, cmap="coolwarm",
linewidths=0.5)
plt.show()
import pandas as pd
# # Generate random data for age, spending score, and income level
age = np.random.randint(18, 70, 500)
# # Create a dataframe
【Code Answer】
import numpy as np
import pandas as pd
# # Generate random data for age, spending score, and income level
# # Create a dataframe
plt.figure(figsize=(15, 5))
plt.subplot(1, 3, 1)
plt.hist(df['Age'], bins=15, color='skyblue', edgecolor='black')
plt.title('Age Distribution')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.subplot(1, 3, 2)
plt.xlabel('Spending Score')
plt.ylabel('Frequency')
plt.subplot(1, 3, 3)
plt.hist(df['IncomeLevel'], bins=15, color='salmon', edgecolor='black')
plt.xlabel('Income Level')
plt.ylabel('Frequency')
plt.tight_layout()
plt.show()
In this exercise, we are visualizing the distribution of features for a dataset
containing customer information. This process helps in analyzing the
underlying patterns in the data before applying any machine learning
model.Feature distribution is important because it can reveal trends,
outliers, and possible biases within the data, which can directly impact
model performance.Here, we create random data for three customer
features: "age," "spending score," and "income level." The data is generated
using the np.random.randint function to simulate realistic values for these
features.Once the dataset is generated, we use histograms to visualize each
feature. Histograms are useful because they show how often data points fall
into specific ranges, which can be used to identify common values and the
spread of the data.We use Matplotlib’s plt.hist function to generate each
histogram. The bins parameter determines the number of bars (bins) that the
histogram will display, while the color and edgecolor are used to customize
the appearance.The titles and labels make it easier to interpret each plot,
and plt.tight_layout() is used to ensure that the plots do not overlap. Finally,
plt.show() renders the visualizations.Understanding how features are
distributed helps to preprocess data more effectively, such as by
normalizing features that are skewed or removing outliers that could distort
model performance.
【Trivia】
Histograms are one of the oldest statistical tools, and they were first
introduced by Karl Pearson in the late 19th century to represent frequency
distributions.
13. Box Plot Outlier Detection in Customer Sales
Data
Importance★★★★☆
Difficulty★★★☆☆
A retail company wants to analyze its sales data to detect potential outliers
that may affect decision-making. Outliers can indicate fraud, data entry
errors, or special events that may need further investigation.The sales data
includes 1000 records representing daily sales in dollars for a specific
product. Your task is to generate the data, create a box plot for visual outlier
detection, and use Python to detect and visualize the outliers.Please write
code that generates a box plot, identifies the outliers in the sales data, and
displays the results.
【Data Generation Code Example】
import numpy as np
import pandas as pd
【Code Answer】
import numpy as np
import pandas as pd
plt.ylabel("Sales in Dollars")
plt.show()
Q1 = df["Daily Sales"].quantile(0.25)
Q3 = df["Daily Sales"].quantile(0.75)
IQR = Q3 - Q1
In this task, we are detecting outliers in sales data using a box plot, a
graphical tool often used in machine learning and data analysis.First, we
generate synthetic data that mimics real-world sales figures. We simulate
daily sales using a normal distribution with a mean of 1000 dollars and a
standard deviation of 250. We also intentionally add a few extreme values
(5000, 5500, and 6000 dollars) to serve as outliers for this analysis. This
type of data generation helps us create a realistic environment where
outliers are present.
Next, a box plot is drawn to visualize the spread of the data. A box plot
shows the median, quartiles (Q1, Q3), and any points that are far away from
the main body of data (potential outliers). The whiskers in a box plot
typically extend to 1.5 times the interquartile range (IQR), and points
outside this range are considered outliers.
We compute Q1 (the 25th percentile) and Q3 (the 75th percentile) to
calculate the IQR (Q3 - Q1). Any sales figures that fall below Q1 - 1.5 *
IQR or above Q3 + 1.5 * IQR are considered outliers. These are printed
after detection for further analysis.
This approach is commonly used in machine learning to clean and
preprocess data before feeding it into models. Detecting and handling
outliers is critical to avoid misleading results, and the box plot is a simple
yet powerful tool for this purpose.
【Trivia】
Did you know that John Tukey, an American mathematician, invented the
box plot in 1977? It's a key tool in exploratory data analysis and is
especially useful for highlighting outliers in datasets.
14. Data Normalization and Visual Comparison for
Machine Learning
Importance★★★★★
Difficulty★★★☆☆
You are working with a retail company that is analyzing customer purchase
patterns.The company wants to build a machine learning model to predict
future customer spending.However, the raw data contains features with
different scales, such as the number of items purchased (a small number)
and the total spending amount (a large number).The company needs your
help to normalize the data for better machine learning model performance
and to visually compare the impact of normalization.Write a Python script
that:Generates synthetic data for customer purchases, including:Number of
items purchasedTotal spending amountNormalizes the data using
MinMaxScaler and StandardScaler.Plots both the original data and the
normalized data for visual comparison.
【Data Generation Code Example】
import numpy as np
【Code Answer】
# Import necessary libraries
import numpy as np
np.random.seed(42)
min_max_scaler = MinMaxScaler()
standard_scaler = StandardScaler()
data_standard_scaled = standard_scaler.fit_transform(data)
# Plot original and normalized data
plt.figure(figsize=(12, 6))
plt.subplot(1, 3, 1)
plt.title('Original Data')
plt.xlabel('Items Purchased')
plt.subplot(1, 3, 2)
plt.subplot(1, 3, 3)
plt.tight_layout()
plt.show()
import pandas as pd
np.random.seed(42)
【Code Answer】
import numpy as np
import pandas as pd
np.random.seed(42)
plt.scatter(data['Income_Age_Interaction'], data['Spending_Score'])
plt.xlabel('Income-Age Interaction')
plt.ylabel('Spending Score')
plt.show()
【Code Answer】
import numpy as np
plt.figure(figsize=(10, 6))
plt.colorbar(label='Customer Cluster')
plt.show()
np.random.seed(0)
## Generate random house sizes between 500 and 3500 square feet
【Code Answer】
import numpy as np
np.random.seed(0)
## Generate random house sizes between 500 and 3500 square feet
sizes = sizes.reshape(-1, 1)
poly = PolynomialFeatures(degree=2)
sizes_poly = poly.fit_transform(sizes)
plt.ylabel('Price (USD)')
plt.legend()
plt.show()
np.random.seed(42)
【Code Answer】
import numpy as np
# # Train-test split #
model = Sequential([
Dense(32, activation='relu'),
])
# # Compile the model using the custom loss function #
model.compile(optimizer='adam', loss=custom_loss)
# # Train the model #
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()
X, y = make_classification(n_samples=100, n_features=5,
n_informative=3, n_classes=2, random_state=42)
【Diagram Answer】
【Code Answer】
import numpy as np
X, y = make_classification(n_samples=100, n_features=5,
n_informative=3, n_classes=2, random_state=42)
# Initialize the decision tree classifier
clf = DecisionTreeClassifier()
plt.xlabel("Fold Number")
plt.ylabel("Accuracy Score")
plt.ylim(0.0, 1.0)
plt.grid(True)
plt.show()
X, y = make_classification(n_samples=100, n_features=2,
n_informative=2, n_redundant=0, n_clusters_per_class=1,
random_state=42)
【Diagram Answer】
【Code Answer】
import numpy as np
X, y = make_classification(n_samples=100, n_features=2,
n_informative=2, n_redundant=0, n_clusters_per_class=1,
random_state=42)
plt.figure(figsize=(10, 5))
plt.subplot(1, 2, idx + 1)
clf.fit(X, y)
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.xlabel('Time Spent')
plt.ylabel('Clicks')
plt.tight_layout()
plt.show()
y = 2 * X + 1 + np.random.randn(100) * 2
【Diagram Answer】
【Code Answer】
import numpy as np
theta0 = 0
theta1 = 0
learning_rate = 0.1
iterations = 50
cost_history = []
theta_history = []
for i in range(iterations):
cost_history.append(cost)
theta_history.append((theta0, theta1))
theta_history = np.array(theta_history)
plt.figure(figsize=(10, 5))
plt.xlabel('Iterations')
plt.ylabel('Cost (MSE)')
plt.legend()
plt.subplot(1, 2, 2)
plt.xlabel('Iterations')
plt.ylabel('theta1')
plt.legend()
plt.tight_layout()
plt.show()
【Code Answer】
import numpy as np
import matplotlib.cm as cm
labels = kmeans.fit_predict(data)
y_lower = 10
for i in range(3):
ith_cluster_silhouette_values = sample_silhouette_values[labels == i]
ith_cluster_silhouette_values.sort()
size_cluster_i = ith_cluster_silhouette_values.shape[0]
color = cm.nipy_spectral(float(i) / 3)
ax1.fill_betweenx(np.arange(y_lower, y_upper), 0,
ith_cluster_silhouette_values, facecolor=color, edgecolor=color,
alpha=0.7)
ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))
y_lower = y_upper + 10
ax1.set_ylabel("Cluster label")
ax1.set_yticks([])
colors = cm.nipy_spectral(labels.astype(float) / 3)
centers = kmeans.cluster_centers_
for i, c in enumerate(centers):
ax.set_xlabel("Feature 1")
ax.set_ylabel("Feature 2")
plt.show()
This task involves clustering customer data using KMeans and evaluating
the clustering quality through the Silhouette Score.We first generated
synthetic data representing customer purchasing patterns using the
make_blobs function, which creates clusters of data points with specific
centers and variations.KMeans is a popular clustering algorithm that groups
data points by minimizing the distance between the points and the cluster
centroids. In this case, we specified that the data should be grouped into 3
clusters. The fit_predict function was used to assign each data point to one
of the clusters based on the KMeans algorithm.To evaluate the quality of
the clustering, we calculated the Silhouette Score. The Silhouette Score
measures how similar a data point is to its own cluster compared to other
clusters. A higher score indicates that the point is well-matched to its
cluster, while a score close to 0 or negative suggests that it is near the
boundary between clusters.The visualization consisted of two parts:‣ The
first plot visualizes the silhouette coefficients for each data point, allowing
us to assess the consistency of each cluster.‣ The second plot shows the
clustered data in two dimensions, with each point color-coded according to
its assigned cluster, and the centroids are marked.This visualization helps us
understand how well the clusters are formed and whether they overlap. The
red dashed line in the silhouette plot represents the average silhouette score,
providing a global measure of clustering quality.
【Trivia】
The Silhouette Score ranges from -1 to 1, where a score close to 1 indicates
that clusters are well-separated, and a score close to -1 suggests that data
points may be wrongly clustered.
23. Comparison of Min-Max Scaling and
Standardization in Machine Learning
Importance★★★★☆
Difficulty★★★☆☆
A retail company has a dataset containing the weekly sales and the number
of customers for each of their stores.
They want to analyze this data using machine learning models to predict
sales, but they are unsure about the best scaling method to use for
preprocessing.
Your task is to:Generate a synthetic dataset representing weekly sales (in
thousands of dollars) and the number of customers.Apply both Min-Max
Scaling and Standardization to the dataset.Plot the scaled data for each
feature under both scaling methods.
By visualizing this, the company hopes to understand how these scaling
techniques affect the distribution of the data and which might be better for
predictive modeling.
import pandas as pd
【Code Answer】
import numpy as np
import pandas as pd
df = pd.DataFrame(data)
## Apply Min-Max Scaling
min_max_scaler = MinMaxScaler()
min_max_scaled = min_max_scaler.fit_transform(df)
## Apply Standardization
standard_scaler = StandardScaler()
standard_scaled = standard_scaler.fit_transform(df)
## Plot the scaled data
ax[0].set_title('Min-Max Scaling')
ax[0].set_xlabel('Sales')
ax[0].set_ylabel('Customers')
ax[0].legend()
ax[1].set_title('Standardization')
ax[1].set_xlabel('Sales')
ax[1].set_ylabel('Customers')
ax[1].legend()
plt.tight_layout()
plt.show()
import pandas as pd
np.random.seed(0)
n_samples = 200
X = np.random.rand(n_samples, 3) * 100
【Code Answer】
import numpy as np
import pandas as pd
np.random.seed(0)
X = np.random.rand(200, 3) * 100
train_errors = [mean_squared_error(y_train[:m],
LinearRegression().fit(X_train[:m], y_train[:m]).predict(X_train[:m])) for
m in range(1, len(y_train)+1)]
val_errors = [mean_squared_error(y_val,
LinearRegression().fit(X_train[:m], y_train[:m]).predict(X_val)) for m in
range(1, len(y_train)+1)]
plt.legend()
plt.show()
【Code Answer】
import matplotlib.pyplot as plt
counter_before = Counter(y)
plt.bar(counter_before.keys(), counter_before.values())
plt.xlabel("Class")
plt.ylabel("Frequency")
plt.show()
plt.bar(counter_after.keys(), counter_after.values())
plt.xlabel("Class")
plt.ylabel("Frequency")
plt.show()
In this task, we aim to visualize class imbalance before and after applying
SMOTE (Synthetic Minority Over-sampling Technique).This technique is
commonly used in classification problems where one class is
underrepresented compared to another, causing skewed model training.We
first generate synthetic data using the make_classification function from
sklearn.datasets, allowing us to specify an imbalanced class distribution
using the weights parameter.Here, we set 90% of the samples to belong to
class 0 (non-churn) and only 10% to class 1 (churn), creating a highly
imbalanced dataset.To see the effect of this imbalance, we use the Counter
function from Python’s collections module to count the instances of each
class and visualize the distribution with a simple bar chart using
matplotlib.This initial plot clearly shows the class imbalance in the target
variable.Next, we apply SMOTE, which generates synthetic examples of
the minority class (in this case, class 1) to balance the dataset.SMOTE
operates by creating new synthetic examples between existing minority
class examples, preserving the feature space structure.After applying
SMOTE, we again use Counter to count the new, balanced class distribution
and plot the result.The second bar chart shows that the class distribution has
been equalized, with both classes now having similar frequencies, which
helps in building a more effective classification model.SMOTE is
particularly useful in machine learning models where class imbalance can
lead to biased predictions and poor performance on the minority class.By
balancing the dataset, the model can learn to predict both classes more
accurately.
【Trivia】
Did you know that SMOTE isn't the only technique to handle imbalanced
datasets?Other methods include undersampling the majority class or using
advanced ensemble techniques like Random Forest with class weights.Each
method has its pros and cons, depending on the dataset size and the nature
of the problem you're solving!
26. Visualizing Feature Importance in Machine
Learning Models
Importance★★★★☆
Difficulty★★★☆☆
Imagine that you are working as a data analyst for a retail company, and
your task is to help improve customer retention by identifying key factors
that influence customer spending. You have data on various customer
attributes, such as age, income, number of visits, and previous purchases.
Your goal is to determine which factors are the most important in predicting
customer spending to better target marketing efforts.Your task is
to:Generate a sample dataset that includes customer spending and
associated attributes.Train a decision tree regressor model on this dataset to
predict customer spending.Visualize the feature importance using a bar
chart to showcase which attributes have the most impact on customer
spending.You need to write code to generate the data, train the model, and
create a visualization of the feature importance. The visualization should
clearly show the ranking of each feature based on its importance in
predicting spending.
【Data Generation Code Example】
import numpy as np
import pandas as pd
np.random.seed(42)
data = pd.DataFrame({
})
【Diagram Answer】
【Code Answer】
import numpy as np
import pandas as pd
np.random.seed(42)
data = pd.DataFrame({
})
X = data.drop('Spending', axis=1)
y = data['Spending']
model = DecisionTreeRegressor()
model.fit(X, y)
features = X.columns
plt.barh(features, feature_importance)
plt.xlabel('Importance')
plt.show()
X, y = make_classification(n_samples=1000, n_features=20,
n_informative=15, n_redundant=5, random_state=42)
## Split dataset
【Code Answer】
import numpy as np
X, y = make_classification(n_samples=1000, n_features=20,
n_informative=15, n_redundant=5, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
plt.xlabel('Training Size')
plt.ylabel('Accuracy')
plt.legend()
plt.grid()
plt.show()
【Code Answer】
import numpy as np
labels = dbscan.fit_predict(data_scaled)
unique_labels = set(labels)
class_member_mask = labels == k
xy = data[class_member_mask]
plt.ylabel('Number of Purchases')
plt.show()
【Code Answer】
import numpy as np
plt.grid(True)
plt.show()
y = 3 * X ** 2 + 5 * X + np.random.normal(0, 5, size=100)
【Code Answer】
import numpy as np
y = 3 * X ** 2 + 5 * X + np.random.normal(0, 5, size=100)
X_train = X_train.reshape(-1, 1)
X_test = X_test.reshape(-1, 1)
degrees = [1, 3, 9]
train_errors = []
test_errors = []
poly = PolynomialFeatures(degree)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.fit_transform(X_test)
model = LinearRegression()
model.fit(X_train_poly, y_train)
y_train_pred = model.predict(X_train_poly)
y_test_pred = model.predict(X_test_poly)
train_errors.append(mean_squared_error(y_train, y_train_pred))
test_errors.append(mean_squared_error(y_test, y_test_pred))
plt.figure()
plt.title('Bias-Variance Tradeoff')
plt.legend()
plt.show()
np.random.seed(42)
【Code Answer】
import numpy as np
np.random.seed(42)
datagen = ImageDataGenerator(rotation_range=30,
width_shift_range=0.2, height_shift_range=0.2, zoom_range=0.2,
horizontal_flip=True)
for i in range(10):
## Original images
plt.subplot(2, 10, i + 1)
plt.axis('off')
## Augmented images
plt.axis('off')
plt.show()
import pandas as pd
np.random.seed(42)
【Code Answer】
import numpy as np
import pandas as pd
np.random.seed(42)
sns.pairplot(data)
X = data
vif_data = pd.DataFrame()
vif_data["Feature"] = X.columns
print(vif_data)
The objective of this task is to help users understand multicollinearity using
Variance Inflation Factor (VIF), which can occur when two or more
predictors in a regression model are highly correlated. High
multicollinearity can cause problems in model interpretation and prediction
accuracy. In this exercise, the dataset consists of three advertising channels
(TV, Radio, and Newspaper), and the goal is to visualize their relationships
using a scatter matrix and compute the VIF to detect any
multicollinearity.The pairplot from the Seaborn library helps visualize the
relationships between the advertising channels. If the scatter plots between
any two features show a clear linear pattern, it suggests that those variables
are correlated, which could lead to multicollinearity.The Variance Inflation
Factor (VIF) quantifies how much the variance of the regression coefficient
is inflated due to multicollinearity. A VIF value above 5–10 is considered
problematic, indicating that the feature is highly correlated with others in
the dataset and might need to be removed or adjusted.In the provided
Python code, the variance_inflation_factor function is used to calculate VIF
for each feature (TV, Radio, Newspaper). The output shows which feature
contributes most to multicollinearity. If a feature's VIF is high, the
marketing team may need to reconsider how that channel is measured or its
importance in the regression model.This exercise is a useful way to
illustrate one of the core challenges in regression analysis: multicollinearity
and its detection. It is especially valuable in marketing, where different ad
channels often influence each other.
【Trivia】
The concept of multicollinearity in regression was first introduced in 1934
by Ragnar Frisch, a Nobel Prize-winning economist.
6. Feature Selection Techniques and Their Impact
on Model Performance
Importance★★★★★
Difficulty★★★★☆
You are working for a healthcare startup that wants to predict whether a
patient will develop diabetes based on various features such as age, BMI,
and blood pressure.Your task is to use feature selection techniques to
identify which features are most important for predicting diabetes.Then,
visualize the effect of these selected features on model performance using a
classifier (e.g., Logistic Regression).Please generate synthetic data using
the following features:AgeBMI (Body Mass Index)Blood PressureInsulin
LevelGlucose LevelAfter generating the data, perform feature selection
using any technique of your choice (e.g., Recursive Feature Elimination
(RFE), SelectKBest, etc.), and compare the model performance with and
without feature selection by plotting the results.
【Data Generation Code Example】
import numpy as np
import pandas as pd
X, y = make_classification(n_samples=500, n_features=5,
n_informative=3, n_classes=2, random_state=42)
df['Diabetes'] = y
【Diagram Answer】
【Code Answer】
import numpy as np
import pandas as pd
X, y = make_classification(n_samples=500, n_features=5,
n_informative=3, n_classes=2, random_state=42)
df = pd.DataFrame(X, columns=features)
df['Diabetes'] = y
model = LogisticRegression()
selected_features = np.array(features)[selector.support_]
model.fit(X_train, y_train)
y_pred_all = model.predict(X_test)
X_train_selected = selector.transform(X_train)
X_test_selected = selector.transform(X_test)
model.fit(X_train_selected, y_train)
y_pred_selected = model.predict(X_test_selected)
plt.ylabel('Accuracy')
plt.show()
import pandas as pd
np.random.seed(0)
【Code Answer】
import numpy as np
import pandas as pd
np.random.seed(0)
sales_df.set_index('Date', inplace=True)
result.plot()
plt.show()
import pandas as pd
np.random.seed(42)
ratings = ratings.stack().reset_index()
ratings.columns = ['User', 'Movie', 'Rating']
【Code Answer】
import numpy as np
import pandas as pd
ratings = ratings.stack().reset_index()
svd = TruncatedSVD(n_components=2)
user_factors = svd.fit_transform(utility_matrix)
movie_factors = svd.components_
predictions = [np.dot(user_factors[utility_matrix.index.get_loc(u)],
movie_factors[:, utility_matrix.columns.get_loc(m)])
## Calculating RMSE
print(f'RMSE: {rmse:.2f}')
## Visualizing predicted vs actual ratings
plt.scatter(test_data['Rating'], predictions)
plt.xlabel('Actual Ratings')
plt.ylabel('Predicted Ratings')
plt.show()
import pandas as pd
np.random.seed(42)
ages = np.random.randint(18, 66, 200)
【Code Answer】
import numpy as np
import pandas as pd
sns.set(style="whitegrid")
g.add_legend()
plt.subplots_adjust(top=0.85)
plt.show()
import pandas as pd
np.random.seed(42)
【Code Answer】
import numpy as np
import pandas as pd
np.random.seed(42)
size = np.random.randint(800, 4000, 500) # house size in sq. ft
y = df['price']
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
plt.scatter(y_test, y_pred)
plt.ylabel('Predicted Prices')
plt.show()
This exercise demonstrates how to use synthetic data for a machine learning
task. Synthetic data is useful for testing models when real data isn't
available or is limited.First, we generate features like size, bedrooms, and
distance_from_city. These features are created using NumPy's random
functions to simulate real-world values. The target variable, price, is
calculated using a simple linear formula with some added noise to simulate
realistic data. The relationship between the features and price includes
positive correlations (more size and bedrooms increase price) and negative
correlations (greater distance from the city reduces price).After generating
the data, we split it into training and testing sets using train_test_split. This
ensures the model is trained on one portion of the data and tested on
another, preventing overfitting.Next, we use a LinearRegression model,
which fits a linear equation to the data. The model learns the coefficients of
the equation from the training data.Once trained, the model predicts house
prices based on the test data. We then plot the predicted prices against the
actual prices. The 45-degree red line on the plot represents perfect
predictions, and the closer the points are to this line, the better the model's
performance.Finally, this plot provides a visual assessment of how well the
model generalizes to unseen data, helping to understand the quality of the
model's predictions. This exercise highlights key steps in machine learning:
data generation, model training, and result visualization.
【Trivia】
Did you know that synthetic data generation is often used in fields like
autonomous driving? Self-driving car companies frequently generate
synthetic data to simulate various driving conditions that are too rare or
dangerous to collect in the real world.
11. Understanding the Impact of Evaluation
Metrics on Model Selection
Importance★★★★☆
Difficulty★★★☆☆
Imagine you are working for an e-commerce company that wants to
improve its product recommendation system.You have been tasked with
evaluating two machine learning models to predict user purchases based on
user data.However, you are unsure which evaluation metric (accuracy,
precision, or recall) should guide your decision.Your goal is to visualize
how different evaluation metrics impact model selection.For simplicity,
generate synthetic classification data and compare two models: a logistic
regression model and a decision tree classifier.Create a plot that shows how
these models perform using the three different metrics.
【Data Generation Code Example】
from sklearn.datasets import make_classification
【Code Answer】
import matplotlib.pyplot as plt
model1 = LogisticRegression()
model2 = DecisionTreeClassifier()
model1.fit(X_train, y_train)
model2.fit(X_train, y_train)
# Making predictions
y_pred1 = model1.predict(X_test)
y_pred2 = model2.predict(X_test)
# Calculating metrics
plt.figure(figsize=(8, 6))
plt.plot(metrics, logistic_scores, label='Logistic Regression', marker='o')
plt.ylabel('Score')
plt.legend()
plt.grid(True)
plt.show()
import pandas as pd
【Code Answer】
import numpy as np
import pandas as pd
np.random.seed(42)
clf = IsolationForest(contamination=0.05)
inliers = df[df['outlier'] == 1]
plt.xlabel('Customer Spending')
plt.ylabel('Number of Transactions')
plt.legend()
plt.show()
import numpy as np
iris = load_iris()
X = iris.data
y = iris.target
【Code Answer】
from sklearn.datasets import load_iris
import numpy as np
iris = load_iris()
X = iris.data
y = iris.target
fold_number = 1
plt.figure()
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.show()
fold_number += 1
【Code Answer】
import numpy as np
kmeans.fit(X)
y_kmeans = kmeans.predict(X)
centroids = kmeans.cluster_centers_
plt.show()
import pandas as pd
old_data = pd.DataFrame({
})
})
【Diagram Answer】
【Code Answer】
import numpy as np
import pandas as pd
old_data = pd.DataFrame({
})
new_data = pd.DataFrame({
axes[0].set_title('Price Comparison')
axes[0].legend()
axes[1].set_title('Stock Comparison')
axes[1].legend()
plt.tight_layout()
plt.show()
In this exercise, you are visualizing data drift between two datasets to detect
any changes in feature distributions. Data drift refers to changes in data
patterns over time, which can reduce the performance of machine learning
models. Visualizing these changes is an important step in maintaining
model accuracy.
The two datasets are created to represent old data used for model training
and new data with slight changes in distributions. The 'price', 'stock', and
'sales_trend' features are generated using random normal distributions. In
the new data, the mean and standard deviation of the features are slightly
shifted to simulate real-world drift.
Histograms are used to visualize the distributions of the features in the old
and new datasets. The goal is to detect whether there are significant shifts in
these distributions. For each feature ('price', 'stock', and 'sales_trend'), you
can compare the overlap between the old and new data histograms to
visually inspect any drift. Large shifts in distributions indicate potential data
drift, which might require model retraining or adjustments to maintain
prediction accuracy.
This approach helps you understand how data drift affects your model and
provides a visual tool to monitor the health of your machine learning
system. Monitoring for data drift regularly is crucial for maintaining the
performance and reliability of your models in a changing data environment.
【Trivia】
Data drift is also known as "covariate shift" or "concept drift" in the
machine learning community. Models trained on old data often lose
accuracy when deployed in a production environment where the underlying
data distribution changes over time. Data drift detection tools help mitigate
this risk by alerting when the model performance may degrade due to
changing data.
16. Visualizing the Impact of Feature Engineering
on Classification Model Performance
Importance★★★★☆
Difficulty★★★☆☆
A retail company is trying to improve its customer segmentation model,
which classifies customers based on their spending habits. Currently, they
have basic data on customer demographics and purchase frequency.
However, they believe that creating new features such as average
transaction value or total purchases per month may improve the model's
performance.Your task is to create a sample dataset, engineer relevant
features, and compare the performance of a classification model before and
after feature engineering.Output a graph that shows the accuracy of the
model with and without the newly engineered features.
【Data Generation Code Example】
import numpy as np
import pandas as pd
X, y = make_classification(n_samples=500, n_features=5,
random_state=42)
df['target'] = y
【Code Answer】
import numpy as np
import pandas as pd
df['target'] = y
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
accuracy_after = accuracy_score(y_test, y_pred)
plt.ylabel('Accuracy')
plt.show()
np.random.seed(0)
【Code Answer】
import numpy as np
np.random.seed(0)
mean_relative_error = np.mean(relative_error)
plt.figure(figsize=(10,6))
plt.xlabel('Day')
plt.ylabel('Sales')
plt.legend()
plt.grid(True)
plt.show()
In this task, you are asked to predict future sales and measure model
performance using a custom metric.First, the CreateDataCode generates
two sets of sales data: actual sales and predicted sales. These values are
simulated using random numbers to mimic real-world sales data.We then
create a custom metric called "relative error," which measures the
difference between the actual and predicted sales as a percentage of the
actual sales. This metric helps us understand how far off the predictions are
from the true values.The relative error for each day is calculated as the
absolute difference between the actual and predicted values, divided by the
actual value. We then compute the mean relative error to get an overall
picture of model accuracy.After computing the metric, we plot the sales
data using matplotlib. The plt.plot() function is used to create a line plot,
where the actual sales and predicted sales are represented with different
markers. The plt.legend() function is added to label the lines, and the plot is
enhanced with gridlines using plt.grid(True).This visualization allows you
to clearly see how well the model performs by comparing the predicted and
actual sales.
【Trivia】
The relative error metric is widely used when evaluating model
performance because it is scale-independent, meaning it is useful whether
you are predicting small or large numbers.
18. Visualizing Class Imbalance in Binary
Classification with Python
Importance★★★★☆
Difficulty★★★☆☆
You are working with a credit card company that is dealing with fraudulent
transactions. The company has provided you with a dataset where
fraudulent transactions make up a very small portion of the total data,
leading to an imbalanced dataset. Your task is to visualize the imbalance of
the dataset using a bar chart and then fit a simple machine learning model to
classify fraudulent and non-fraudulent transactions.Create a synthetic
dataset representing transactions, where the class '1' represents fraud and
the class '0' represents non-fraud. Generate 1000 data points where only 5%
of the transactions are fraudulent.After visualizing the imbalance, fit a
logistic regression model to this dataset, and calculate the accuracy of the
model.
【Data Generation Code Example】
import numpy as np
import pandas as pd
n_samples = 1000
fraud_ratio = 0.05
data = np.random.rand(n_samples, 2)
labels = np.array([1 if i < int(n_samples * fraud_ratio) else 0 for i in
range(n_samples)])
【Diagram Answer】
【Code Answer】
import numpy as np
import pandas as pd
np.random.seed(42)
n_samples = 1000
fraud_ratio = 0.05
data = np.random.rand(n_samples, 2)
plt.ylabel('Number of Transactions')
plt.show()
model = LogisticRegression()
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Calculate accuracy
import pandas as pd
np.random.seed(42)
X = np.random.rand(100, 3) * 100 # Square footage, number of
bedrooms, age
【Code Answer】
import numpy as np
import pandas as pd
np.random.seed(42)
X = np.random.rand(100, 3) * 100 # Square footage, number of
bedrooms, age
# # Split data
model.fit(X_train, y_train)
# # Predict
y_pred = model.predict(X_test)
plt.scatter(y_test, y_pred)
plt.xlabel('Actual Prices')
plt.ylabel('Predicted Prices')
plt.show()
import pandas as pd
【Code Answer】
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
scaler_minmax = MinMaxScaler()
scaler_standard = StandardScaler()
data_minmax = scaler_minmax.fit_transform(data)
data_standard = scaler_standard.fit_transform(data)
plt.figure(figsize=(10, 6))
plt.subplot(1, 3, 1)
plt.title('Original Data')
plt.xlabel('Height')
plt.ylabel('Weight')
plt.legend()
plt.subplot(1, 3, 2)
plt.xlabel('Height')
plt.ylabel('Weight')
plt.legend()
plt.subplot(1, 3, 3)
plt.xlabel('Height')
plt.ylabel('Weight')
plt.legend()
plt.tight_layout()
plt.show()
import numpy as np
np.random.seed(0)
#Create DataFrame
df
【Diagram Answer】
【Code Answer】
import pandas as pd
import numpy as np
np.random.seed(0)
similarity_matrix = cosine_similarity(df)
target_customer = 'Customer_1'
similar_customers =
similarity_df[target_customer].sort_values(ascending=False)
similar_customers
【Code Answer】
import numpy as np
kmeans = KMeans(n_clusters=4)
labels = kmeans.fit_predict(data)
plt.xlabel('Annual Income')
plt.ylabel('Spending Score')
plt.legend()
plt.show()
np.random.seed(42)
【Code Answer】
import numpy as np
kmeans = KMeans(n_clusters=3)
labels = kmeans.fit_predict(data)
plt.figure(figsize=(8,6))
plt.xlabel('Electronics Purchases')
plt.ylabel('Groceries Purchases')
plt.colorbar()
plt.show()
X = np.c_[X_price.ravel(), X_advertising.ravel()]
【Code Answer】
import numpy as np
X = np.c_[X_price.ravel(), X_advertising.ravel()]
y = 50 + 0.5 * X[:, 0] - 0.02 * X[:, 1] + 0.001 * X[:, 0] * X[:, 1] +
np.random.randn(*X[:, 0].shape) * 5
## Train the Decision Tree Regressor
model = DecisionTreeRegressor()
model.fit(X, y)
## Predict sales
y_pred = model.predict(X)
y_pred = y_pred.reshape(X_price.shape)
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.set_xlabel('Price')
ax.set_ylabel('Advertising Budget')
ax.set_zlabel('Predicted Sales')
plt.show()
In this exercise, the goal is to understand how two features, "price" and
"advertising budget," interact to affect sales predictions.A decision tree
regressor is used to predict sales based on these features, which means that
the algorithm will learn from the training data to capture non-linear
relationships between the features.We start by generating a dataset that
simulates possible combinations of price and advertising budget values.The
simulated dataset also includes some random noise to reflect realistic
data.The feature matrix X contains two features: price and advertising
budget, which we simulate using numpy.meshgrid.The target variable y is
generated based on a simple mathematical formula that introduces a small
interaction between the two features, simulating how real-world data might
behave.After generating the dataset, we use a DecisionTreeRegressor,
which is a type of model that can capture complex interactions and non-
linear relationships between features.The model is trained using the fit()
method, which learns the mapping from the input features to the target
variable (sales).Once the model is trained, predictions are made over the
input features using the predict() method.To visualize the interaction
between price and advertising budget, we use a 3D surface plot.This plot
shows how predicted sales change as both the price and the advertising
budget vary.The x-axis represents price, the y-axis represents advertising
budget, and the z-axis (height of the surface) represents predicted sales.This
visualization helps in understanding the model's prediction behavior in a
more intuitive way, particularly in terms of feature interactions.The model
will capture how the interaction between the two features influences the
outcome, making it easier to analyze the combined effect of price and
advertising budget on predicted sales.
【Trivia】
Did you know that decision tree regressors are not just used for regression
tasks, but also for classification?They work by splitting the data into
smaller subsets based on feature values, creating a tree structure where each
node represents a decision rule based on the features.This makes them
flexible and interpretable models, although they can sometimes overfit if
not properly tuned.
25. Visualizing the Impact of Regularization
Techniques on Overfitting in a Machine Learning
Model
Importance★★★★☆
Difficulty★★★☆☆
A company is developing a machine learning model to predict house prices
based on multiple features (e.g., size, number of rooms, location, etc.).They
are experiencing overfitting issues with their model, meaning it performs
well on the training data but poorly on unseen data.Your task is to help the
company understand how regularization techniques (L1, L2) can improve
generalization and reduce overfitting.Create a regression model using
synthetic data and visualize the impact of L1 (Lasso) and L2 (Ridge)
regularization on the coefficients of the features.Your output should include
a graph showing the effect of regularization on the learned coefficients as
the strength of regularization increases.
【Data Generation Code Example】
import numpy as np
【Code Answer】
import numpy as np
lasso_coefs = np.array(lasso_coefs)
ridge_coefs = np.array(ridge_coefs)
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
for i in range(X.shape[1]):
plt.xscale('log')
plt.ylabel('Coefficients')
plt.legend()
plt.subplot(1, 2, 2)
for i in range(X.shape[1]):
plt.xscale('log')
plt.ylabel('Coefficients')
plt.title('Ridge Regularization (L2)')
plt.legend()
plt.tight_layout()
plt.show()
X, y = make_classification(n_samples=300, n_features=2,
n_informative=2, n_redundant=0, n_clusters_per_class=1,
random_state=42)
【Diagram Answer】
【Code Answer】
import numpy as np
X, y = make_classification(n_samples=300, n_features=2,
n_informative=2, n_redundant=0, n_clusters_per_class=1,
random_state=42)
clf_bag.fit(X, y)
clf_boost.fit(X, y)
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
ax.set_title(title)
plt.show()
【Code Answer】
import numpy as np
n_classes = y.shape[1]
classifier = OneVsRestClassifier(LogisticRegression(solver='liblinear'))
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_classes):
plt.figure()
for i in range(n_classes):
# # Plot settings
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.legend(loc='lower right')
plt.show()
【Code Answer】
import numpy as np
embeddings_scaled = scaler.fit_transform(embeddings)
#### Apply PCA to reduce to 2 dimensions
pca = PCA(n_components=2)
reduced_embeddings = pca.fit_transform(embeddings_scaled)
plt.figure(figsize=(10, 7))
plt.grid(True)
plt.show()
【Code Answer】
import numpy as np
import tensorflow as tf
y = demand[5:30]
Dense(1)])
plt.figure()
plt.xlabel('Day')
plt.ylabel('Demand')
plt.legend()
plt.title('RNN Prediction of Daily Demand')
plt.show()
【Code Answer】
import numpy as np
import matplotlib.pyplot as plt
model = Sequential([
])
model.compile(optimizer='adam', loss='binary_crossentropy')
return model
# # Train models with different activation functions and plot the outputs
outputs = []
model = create_model(act)
plt.figure(figsize=(10, 6))
plt.ylabel('Output Value')
plt.legend()
plt.show()
import pandas as pd
np.random.seed(42)
【Code Answer】
import numpy as np
import pandas as pd
np.random.seed(42)
X = data['Month'].values.reshape(-1, 1)
y = data['Sales'].values
model = LinearRegression()
model.fit(X, y)
y_pred = model.predict(X)
plt.xlabel('Month')
plt.ylabel('Sales')
plt.legend()
plt.show()
In this task, we are using a linear regression model to forecast future sales
based on historical sales data.
We begin by generating synthetic monthly sales data over a period of 36
months.
The sales data is created with a small linear growth factor, plus some
random noise to make it more realistic.
Next, we prepare the dataset for the machine learning model.
The month numbers are used as the input (X), and the sales values are used
as the target (y).
To build the forecasting model, we use the LinearRegression class from the
sklearn library.
After fitting the model to the data, we use it to predict the sales for each
month in the dataset.
Finally, we visualize both the actual sales data and the predicted sales data
on the same graph.
This allows us to evaluate how well the model fits the data and how
accurate the forecasts are.
The graph helps in comparing the trend and any deviations between the
predicted and actual values.
This exercise demonstrates how linear regression can be used to forecast
time series data.
In real-world applications, more complex models might be required for
datasets with seasonality or non-linear trends.
【Trivia】
Linear regression is one of the simplest forecasting models, but it is often
not sufficient for capturing complex patterns in time series data.Advanced
methods such as ARIMA, Prophet, or LSTM (neural networks) are
frequently used for more accurate forecasting in business applications.
32. Visualizing the Impact of Dropout in a Neural
Network for Customer Churn Prediction
Importance★★★★☆
Difficulty★★★☆☆
A telecommunications company is facing a high customer churn rate, where
customers are canceling their services at an increasing rate.You are tasked
with building a simple neural network model to predict customer churn
using synthetic data.The focus is to visualize how different dropout rates
affect the model's performance and its ability to generalize.Please create a
neural network using Keras with at least two hidden layers, and use dropout
as a regularization technique.Compare the performance of the model with
and without dropout and plot the training and validation losses for both
cases.Generate synthetic data to simulate customer features, with 10 input
features and a binary target variable indicating churn (1 for churn, 0 for no
churn).
【Data Generation Code Example】
import numpy as np
np.random.seed(42)
【Code Answer】
#import necessary libraries
import numpy as np
np.random.seed(42)
X = np.random.rand(1000, 10)
y = np.random.randint(0, 2, 1000)
def create_model(dropout_rate):
model = Sequential()
model.add(Dense(64, input_dim=10, activation='relu'))
model.add(Dropout(dropout_rate))
model.add(Dense(32, activation='relu'))
model.add(Dropout(dropout_rate))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer=Adam(), loss='binary_crossentropy',
metrics=['accuracy'])
return model
model_dropout = create_model(0.5)
model_no_dropout = create_model(0.0)
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()
np.random.seed(42)
【Code Answer】
import numpy as np
np.random.seed(42)
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
plt.subplot(1, 2, 1)
plt.hist(residuals, bins=30, edgecolor='black', alpha=0.7)
plt.title('Histogram of Residuals')
plt.xlabel('Residual')
plt.ylabel('Frequency')
plt.subplot(1, 2, 2)
plt.tight_layout()
plt.show()
【Code Answer】
import numpy as npimport pandas as pdimport matplotlib.pyplot as
pltfrom sklearn.preprocessing import StandardScaler, MinMaxScaler#
Create sample datanp.random.seed(42)data = pd.DataFrame({'age':
np.random.normal(loc=40, scale=12, size=100), # # Age feature
(normally distributed)'income': np.random.lognormal(mean=3,
sigma=0.8, size=100), # # Income feature (log-normal
distribution)'purchases': np.random.uniform(1, 50, 100) # # Previous
purchases (uniform distribution)})# Initialize scalersstandard_scaler =
StandardScaler()minmax_scaler = MinMaxScaler()# Apply
StandardScaler and MinMaxScalerdata_standard_scaled =
pd.DataFrame(standard_scaler.fit_transform(data),
columns=data.columns)data_minmax_scaled =
pd.DataFrame(minmax_scaler.fit_transform(data),
columns=data.columns)# Plotting the original, StandardScaled, and
MinMaxScaled distributionsfig, axs = plt.subplots(3, 3, figsize=(15,
12))for i, column in enumerate(data.columns):# # Original feature
distributionaxs[0, i].hist(data[column], bins=15, color='blue',
alpha=0.7)axs[0, i].set_title(f'Original {column}')# # StandardScaler
feature distribution
plt.tight_layout()plt.show()
import pandas as pd
df = pd.DataFrame(data)
【Diagram Answer】
【Code Answer】
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier, plot_tree
df = pd.DataFrame(data)
y = df['Customer Churn']
clf.fit(X_train, y_train)
plt.figure(figsize=(10, 8))
plt.show()
import pandas as pd
np.random.seed(0)
# Creating a DataFrame
【Code Answer】
import numpy as np
import pandas as pd
np.random.seed(0)
X = data[['MarketingExpenditure']]
y = data['Sales']
model = LinearRegression()
model.fit(X, y)
predictions = model.predict(X)
residuals = y - predictions
# Step 5: Create residual plot
plt.scatter(predictions, residuals)
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residual Plot')
plt.show()
This exercise involves assessing the quality of a linear regression model by
plotting residuals. Linear regression assumes that the residuals (errors
between the observed and predicted values) are randomly distributed with a
constant variance and are normally distributed around zero.Generating
Data: A dataset is first generated where the sales depend on marketing
expenditure with some added noise (random variation), which simulates
real-world scenarios where data often includes unpredictable
variations.Fitting the Model: We fit a linear regression model using the
LinearRegression class from the sklearn library, where the independent
variable is the marketing expenditure, and the dependent variable is the
sales.Residuals Calculation: Once the model has predicted the sales values,
we calculate the residuals by subtracting the predicted sales values from the
actual sales values. These residuals are essential in determining the model's
performance.Residual Plot: The residual plot is generated by plotting the
predicted values on the x-axis and the residuals on the y-axis. The
horizontal red line indicates zero residuals, and ideally, residuals should be
scattered randomly around this line without forming any patterns.If the
residuals form a clear pattern, it suggests that the linear regression model
may not be appropriate for this data, indicating that other types of models or
transformations might be needed to improve the predictions.
【Trivia】
Residual plots are one of the most important diagnostic tools for evaluating
regression models. If residuals display any pattern, such as a curve or
funnel shape, it suggests that the model might be missing key variables or
interactions. In such cases, nonlinear regression models or transformations
might be necessary to capture the true relationship.
37. PCA and Biplots for Customer Data Analysis
Importance★★★★☆
Difficulty★★★☆☆
Imagine you are a data scientist at a retail company. The company collects
data on customer behavior, including their total spending in various product
categories. Your task is to analyze customer spending patterns to identify
underlying factors or trends that explain their behavior. You are asked to
perform a Principal Component Analysis (PCA) on customer data and
visualize the results using a biplot.Create a dataset that simulates spending
behavior across different product categories, such as 'Food', 'Electronics',
'Clothing', and 'Furniture'.Perform PCA to reduce the dimensionality of the
dataset and visualize the results with a biplot. The company wants to know
if customer behaviors can be grouped or explained by fewer components.
【Data Generation Code Example】
import numpy as np
np.random.seed(42)
【Code Answer】
# Importing necessary libraries
import numpy as np
import pandas as pd
customers = 100
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data)
pca = PCA(n_components=2)
pca_result = pca.fit_transform(data_scaled)
plt.figure(figsize=(8,6))
plt.scatter(score[:,0], score[:,1], color='blue', s=50) # Plot the
principal components
for i in range(len(coeff)):
plt.arrow(0, 0, coeff[i,0]*max(score[:,0]),
coeff[i,1]*max(score[:,1]),
if labels is None:
plt.text(coeff[i,0]*max(score[:,0])*1.2,
coeff[i,1]*max(score[:,1])*1.2,
plt.text(coeff[i,0]*max(score[:,0])*1.2,
coeff[i,1]*max(score[:,1])*1.2,
plt.grid(True)
plt.show()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
【Diagram Answer】
【Code Answer】
import numpy as np
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
model = Sequential([
Dense(16, activation='relu'),
Dense(1, activation='sigmoid')
])
# # Train the model and capture the history of the training process
acc = history.history['accuracy']
val_acc = history.history['val_accuracy']
loss = history.history['loss']
val_loss = history.history['val_loss']
# # Create epochs range for the x-axis
plt.figure(figsize=(12, 6))
# # Accuracy plot
plt.subplot(1, 2, 1)
plt.legend(loc='lower right')
plt.title('Training and Validation Accuracy')
# # Loss plot
plt.subplot(1, 2, 2)
plt.legend(loc='upper right')
plt.show()
In this task, you are training a neural network model to predict customer
churn using artificial data generated within the code.The neural network is
created using TensorFlow’s Sequential model, which allows layers to be
stacked one after another.The model has an input layer with 10 features,
followed by two hidden layers with 32 and 16 neurons, respectively.The
final output layer has a single neuron with a sigmoid activation function,
which is suitable for binary classification (predicting whether a customer
will churn or not).The model is compiled with the Adam optimizer and
binary cross-entropy loss function.The training history, including accuracy
and loss over 30 epochs, is stored and then visualized using matplotlib.The
training process runs for 30 epochs, and both training and validation data
are monitored. The accuracy and loss metrics are extracted from the
training history to visualize the performance of the model.Two line plots are
created: one for accuracy and another for loss, for both training and
validation sets. These plots are essential for understanding whether the
model is learning effectively or overfitting.In the code, matplotlib is used to
generate the plots. The epochs (1 through 30) are used on the x-axis, while
accuracy and loss values are plotted on the y-axis.By observing the graphs,
you can assess how well the model is performing on both the training and
validation datasets. If validation loss increases while validation accuracy
decreases, this indicates overfitting, where the model is performing well on
the training data but poorly on unseen data.
【Trivia】
The early stopping technique can be applied during model training to
prevent overfitting by stopping the training once the model performance
stops improving on the validation data.
39. Building and Visualizing a Multi-Output
Regression Model for Predicting House Prices and
Rental Rates
Importance★★★★☆
Difficulty★★★☆☆
You are working for a real estate company that wants to predict both house
prices and rental rates based on several features.Your task is to create a
multi-output regression model to predict both target variables using
machine learning.The features include the size of the house (in square
meters), the number of bedrooms, and the distance to the city center (in
kilometers).Please train the model, visualize the predicted vs. actual values
for both outputs, and display the two graphs in one plot.
【Data Generation Code Example】
import numpy as np
np.random.seed(42)
y = np.column_stack([y1, y])
【Diagram Answer】
【Code Answer】
import numpy as np
#Generate data
np.random.seed(42)
y = np.column_stack([y1, y])
model =
MultiOutputRegressor(RandomForestRegressor(n_estimators=100,
random_state=42))
model.fit(X_train, y_train)
#Make predictions
y_pred = model.predict(X_test)
plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
plt.subplot(1, 2, 2)
plt.tight_layout()
plt.show()
import pandas as pd
【Code Answer】
import numpy as np
import pandas as pd
y = data['Sales']
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)
y_pred_lr = lr_model.predict(X_test)
dt_model = DecisionTreeRegressor()
dt_model.fit(X_train, y_train)
y_pred_dt = dt_model.predict(X_test)
#Create a plot comparing the predicted sales from both models to the
actual sales
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred_lr, color='blue', label='Linear Regression',
alpha=0.6)
plt.xlabel('Actual Sales')
plt.ylabel('Predicted Sales')
plt.legend()
plt.show()
np.random.seed(42)
for loc, scale in [([50, 200], 15), ([100, 500], 30), ([200, 700],
20)]])
【Diagram Answer】
【Code Answer】
import numpy as np
np.random.seed(42)
kmeans = KMeans(n_clusters=3)
kmeans_labels = kmeans.fit_predict(X)
dbscan_labels = dbscan.fit_predict(X)
plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
plt.title('K-Means Clustering')
plt.xlabel('Total Purchases')
plt.ylabel('Number of Transactions')
plt.subplot(1, 2, 2)
plt.xlabel('Total Purchases')
plt.ylabel('Number of Transactions')
plt.tight_layout()
plt.show()
X, y = make_classification(n_samples=1000, n_features=20,
n_informative=2, n_redundant=10, random_state=42)
【Diagram Answer】
【Code Answer】
import numpy as np
X, y = make_classification(n_samples=1000, n_features=20,
n_informative=2, n_redundant=10, random_state=42)
model = LogisticRegression()
model.fit(X, y)
plt.xlabel("Decision Threshold")
plt.ylabel("Rate")
plt.legend(loc="best")
plt.grid(True)
plt.show()
import pandas as pd
df = pd.DataFrame(X, columns=feature_names)
df['Sales'] = y
【Diagram Answer】
【Code Answer】
import numpy as np
import pandas as pd
model.fit(X, y)
importances = model.feature_importances_
plt.barh(feature_names, importances)
plt.xlabel('Importance')
plt.ylabel('Features')
plt.show()
In this exercise, we are using a decision tree regressor to predict sales based
on five product-related features.First, we generate the data using
make_regression(), which simulates regression data. The resulting dataset
has 100 samples and 5 features, which we label Feature_A to
Feature_E.The decision tree algorithm is well-suited for identifying
important features in datasets because it makes splits based on features that
reduce prediction error the most.Once the model is trained, we can extract
the feature importances.The feature_importances_ attribute of the trained
decision tree gives us a measure of each feature's contribution to the
predictions.These importances are then visualized using a horizontal bar
chart, which allows us to see which features are most influential in
predicting sales.Visualizing feature importances helps decision-makers
understand which factors drive sales, enabling them to focus on optimizing
those aspects.In this case, the bar chart provides a clear visual guide
showing which features are more significant in influencing sales, and this
can guide future business decisions regarding product development, pricing
strategies, or marketing efforts.
【Trivia】
Did you know that decision trees are considered white-box models because
their decision-making process is easy to interpret?This is in contrast to
models like neural networks, which are often referred to as black-box
models due to their complexity in explaining predictions!
44. Visualizing Confusion Matrices for Multi-Class
Classification Problems
Importance★★★★★
Difficulty★★★☆☆
A client is using a machine learning model to classify images of fruits into
three categories: apples, bananas, and oranges.The client is not only
interested in evaluating the accuracy of the model but also wants to
understand which categories are often confused with each other.Your task is
to simulate a simple multi-class classification problem, train a model, and
then generate a confusion matrix to visualize these misclassifications.You
need to create sample data, train a model, and output a confusion matrix to
help the client understand the performance of their model.Use the following
steps:Generate synthetic data with three classes (apples, bananas,
oranges).Train a simple machine learning classifier (e.g., Logistic
Regression).Display the confusion matrix for the model's predictions.
【Data Generation Code Example】
import numpy as np
【Code Answer】
import numpy as np
model = LogisticRegression()
model.fit(X_train, y_train)
cm = confusion_matrix(y_test, y_pred)
# Plot the confusion matrix
plt.figure(figsize=(6,6))
plt.xlabel("Predicted Labels")
plt.ylabel("True Labels")
plt.show()
This exercise is centered around visualizing a confusion matrix, which is a
powerful tool for evaluating multi-class classification models.We first
generate synthetic data that mimics a classification problem with three
classes (apples, bananas, and oranges) using the make_classification
function.The data is split into training and testing sets to evaluate the
performance of the model on unseen data.We use a logistic regression
model, which is suitable for simple classification tasks and easy to
train.Once the model is trained, it is used to make predictions on the test
set.The confusion matrix is generated using the confusion_matrix function
from sklearn.metrics.The confusion matrix provides insight into how often
each class (apple, banana, or orange) is correctly or incorrectly
classified.Each row of the matrix represents the true label, and each column
represents the predicted label.Finally, we use the seaborn library's heatmap
functionality to visually display the confusion matrix.The heatmap
highlights where misclassifications occur, allowing for easy identification
of which categories are most frequently confused.For example, if bananas
are often classified as oranges, the corresponding cell in the matrix will
have a higher value.The x-axis shows the predicted labels, and the y-axis
shows the true labels, making it easy to interpret.The confusion matrix
helps understand the model's weaknesses, guiding further improvements in
model training, feature engineering, or data collection.
【Trivia】
The confusion matrix was originally developed for binary classification but
has been widely extended to multi-class classification tasks.Its name comes
from its ability to capture where a model is "confused" about the true labels,
providing deeper insights than simple accuracy scores.
45. Visualizing the Distribution of Customer
Product Preferences
Importance★★★★☆
Difficulty★★☆☆☆
A retail company wants to analyze customer preferences across different
product categories.Your task is to visualize the distribution of customers
who prefer each product category.You need to generate synthetic data that
simulates customer preferences for three product
categories:"Electronics""Clothing""Home and Garden"The company is
looking for an effective way to understand the percentage of customers
interested in each category to optimize its marketing strategy.Please
generate a bar plot showing the distribution of these preferences.
【Data Generation Code Example】
import numpy as np
import pandas as pd
df = pd.DataFrame({'Category': categories})
【Diagram Answer】
【Code Answer】
import numpy as np
import pandas as pd
customers = 1000
df = pd.DataFrame({'Category': categories})
category_counts = df['Category'].value_counts()
plt.figure(figsize=(8,6))
plt.xlabel('Product Category')
plt.ylabel('Number of Customers')
plt.show()
import pandas as pd
np.random.seed(42)
data = pd.DataFrame({
})
【Diagram Answer】
【Code Answer】
import numpy as np
import pandas as pd
data = pd.DataFrame({
scaler = MinMaxScaler()
X = data_scaled
y = data['product_price'] * data['units_sold']
model = LinearRegression()
model.fit(X, y)
plt.figure(figsize=(10, 5))
plt.title('Original Data')
plt.xlabel('Product Price')
plt.ylabel('Units Sold')
plt.subplot(1, 2, 2)
plt.title('Scaled Data')
plt.xlabel('Product Price (Scaled)')
plt.tight_layout()
plt.show()
import pandas as pd
【Code Answer】
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
rf = RandomForestClassifier()
lr.fit(X_train, y_train)
rf.fit(X_train, y_train)
## Make predictions
y_pred_lr = lr.predict(X_test)
y_pred_rf = rf.predict(X_test)
metrics = {
metrics_df.plot(kind='bar', figsize=(10,6))
plt.ylabel('Score')
plt.xticks(rotation=0)
plt.legend(loc='lower right')
plt.show()
【Code Answer】
import numpy as np
mse_values = []
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
mse_values.append(mse)
plt.figure()
plt.grid(True)
plt.show()
【Code Answer】
import numpy as npimport pandas as pdimport matplotlib.pyplot as
pltfrom sklearn.ensemble import RandomForestClassifier,
GradientBoostingClassifierfrom sklearn.model_selection import
train_test_splitfrom sklearn.preprocessing import
StandardScalernp.random.seed(42)age = np.random.randint(18, 70,
1000)income = np.random.randint(20000, 120000, 1000)prev_purchases
= np.random.randint(0, 20, 1000)purchase = np.random.choice([0, 1],
1000)data = pd.DataFrame({'Age': age, 'Income': income,
'PreviousPurchases': prev_purchases, 'Purchased': purchase})X =
data[['Age', 'Income', 'PreviousPurchases']]y = data['Purchased']X_train,
X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)scaler = StandardScaler()X_train_scaled =
scaler.fit_transform(X_train)X_test_scaled =
scaler.transform(X_test)clf_rf =
RandomForestClassifier()clf_rf.fit(X_train_scaled, y_train)clf_gb =
GradientBoostingClassifier()clf_gb.fit(X_train_scaled,
y_train)importances_rf = clf_rf.feature_importances_importances_gb =
clf_gb.feature_importances_features = X.columnsplt.figure(figsize=(10,
5))plt.barh(features, importances_rf, color='blue', alpha=0.5,
label='Random Forest')plt.barh(features, importances_gb, color='red',
alpha=0.5, label='Gradient
Boosting')plt.xlabel('Importance')plt.title('Feature Importance: Random
Forest vs Gradient Boosting')plt.legend()plt.show()
【Code Answer】
import numpy as np
X, y = make_classification(n_samples=200, n_features=2,
n_informative=2, n_redundant=0, n_clusters_per_class=1,
random_state=42)
## Fit logistic regression model
clf = LogisticRegression()
clf.fit(X, y)
plt.figure(figsize=(8,6))
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()
import pandas as pd
np.random.seed(42)
【Code Answer】
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
np.random.seed(42)
plt.xlabel('Date')
plt.ylabel('Sales')
plt.legend()
plt.grid(True)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
In this problem, the goal is to visualize the performance of two time series
forecasting models against actual data.
First, synthetic data is generated to represent 100 days of sales data, along
with the predicted values from two models (Linear Regression and
ARIMA).
The actual sales data is simulated using a random normal distribution that is
cumulatively summed to create a trend-like behavior.
For each model, we added noise to the actual data to simulate the
predictions, with different levels of error for the linear regression and
ARIMA models.
In the visualization, Matplotlib is used to plot the time series data, with each
line representing either the actual sales or predictions from the two models.
The actual sales data is plotted with a solid line, while predictions are
plotted with dashed or dotted lines to distinguish them visually.
Additional plot elements include the title, axis labels, a legend to identify
each line, and gridlines for better readability.
By examining the plot, we can easily compare the accuracy of each model
over time and identify periods where the models may have diverged from
actual sales trends.
【Trivia】
Time series models are essential in business forecasting.
A common challenge is selecting the right model, as different models
handle trend, seasonality, and noise differently.
52. Visualizing Text Classification Results Using a
Confusion Matrix
Importance★★★★☆
Difficulty★★★☆☆
A company is facing challenges in classifying customer feedback as either
"positive" or "negative."The company has implemented a basic text
classification model to automatically label the feedback.However,
management would like to visualize how well the model is
performing.Create a Python program that trains a simple text classification
model using a sample dataset and visualizes the results using a confusion
matrix.Your task is to create the dataset, train a logistic regression model,
and generate a confusion matrix plot to show the classification
performance.Use the confusion matrix to explain how often the model
correctly classifies feedback as positive or negative.
【Data Generation Code Example】
from sklearn.model_selection import train_test_split
【Code Answer】
from sklearn.linear_model import LogisticRegression
X_train_vect = vectorizer.fit_transform(X_train)
X_test_vect = vectorizer.transform(X_test)
model = LogisticRegression()
model.fit(X_train_vect, y_train)
## Predicting on the test set
y_pred = model.predict(X_test_vect)
cm = confusion_matrix(y_test, y_pred)
disp.plot(cmap=plt.cm.Blues)
np.random.seed(42)
【Code Answer】
import numpy as np
np.random.seed(42)
accuracies = []
X, y = samples(size), labels(size)
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracies.append(accuracy_score(y_test, y_pred))
plt.xlabel('Sample Size')
plt.ylabel('Accuracy')
plt.grid(True)
plt.show()
In this task, we aim to understand the relationship between sample size and
model accuracy using a machine learning classifier, Logistic Regression.We
begin by generating a synthetic dataset using random integers. This dataset
simulates customer interactions, where each data point contains three
features:‣ 'time spent on website'‣ 'items viewed'‣ 'cart additions'The target
label, 'purchase made', is a binary classification (0 or 1).The dataset size
varies from 100 to 10,000, allowing us to analyze how increasing data
impacts model accuracy.We use the train_test_split function to divide the
data into training and test sets, ensuring that the model can generalize well
on unseen data.A Logistic Regression model is then trained on the training
data, and predictions are made on the test data.The accuracy of the model's
predictions is calculated using accuracy_score.Finally, the results are
plotted, showing how the model’s accuracy changes as the sample size
increases.This visualization demonstrates that a larger dataset typically
improves the model’s performance by reducing variance and helping it
generalize better.
【Trivia】
The relationship between sample size and accuracy is often referred to as
the "bias-variance tradeoff." Small sample sizes can lead to overfitting,
where the model learns noise from the data instead of meaningful patterns,
leading to high variance and lower accuracy on new data. Larger datasets
reduce this risk, but they also increase the computational resources required
for training.
54. Visualizing Learning Curves for Machine
Learning Models
Importance★★★★★
Difficulty★★★☆☆
A retail company is analyzing customer purchasing patterns to improve its
recommendation system.You are tasked with comparing the performance of
two machine learning models: Logistic Regression and Random Forest.The
company wants to understand how the models' performance improves as the
amount of training data increases.Using Python, create synthetic data that
simulates customer purchases (binary classification).Then, train both
models and plot their learning curves.Ensure that the plot shows both
training and validation scores as a function of the training set size.
【Data Generation Code Example】
import numpy as np
X, y = make_classification(n_samples=1000, n_features=20,
n_informative=2, n_classes=2, random_state=42)
【Code Answer】
import numpy as np
X, y = make_classification(n_samples=1000, n_features=20,
n_informative=2, n_classes=2, random_state=42)
logistic_model = LogisticRegression()
random_forest_model = RandomForestClassifier()
plt.ylabel('Accuracy')
plt.title(title)
plt.legend()
plt.grid(True)
plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
plt.subplot(1, 2, 2)
plot_learning_curve(random_forest_model, X_train, y_train, 'Random
Forest Learning Curve')
plt.tight_layout()
plt.show()
np.random.seed(0)
y = purchase
【Diagram Answer】
【Code Answer】
import numpy as np
np.random.seed(0)
y = purchase
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
model = LogisticRegression()
model.fit(X_scaled, y)
age_fixed = 35
web_activity_fixed = 20
# Prepare the input data with varying income while keeping other features
constant
X_test_scaled = scaler.transform(X_test)
purchase_probabilities = model.predict_proba(X_test_scaled)[:, 1]
# Plot the results to visualize sensitivity to income changes
plt.xlabel("Income")
plt.legend()
plt.grid(True)
plt.show()
import pandas as pd
np.random.seed(42)
# # Create DataFrame
【Code Answer】
import numpy as np
import pandas as pd
np.random.seed(42)
y = data['default']
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred_raw = model.predict(X_test)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
model_scaled = LogisticRegression()
model_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = model_scaled.predict(X_test_scaled)
conf_matrix_scaled = confusion_matrix(y_test, y_pred_scaled)
axes[0].set_xlabel('Predicted')
axes[0].set_ylabel('Actual')
axes[1].set_xlabel('Predicted')
axes[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
import tensorflow as tf
np.random.seed(42)
y = np.random.randint(0, 3, 300)
# Splitting data into training and testing sets
【Code Answer】
import numpy as np
import tensorflow as tf
np.random.seed(42)
model = models.Sequential([
layers.MaxPooling2D((2, 2)),
layers.MaxPooling2D((2, 2)),
layers.Flatten(),
layers.Dense(64, activation='relu'),
layers.Dense(3, activation='softmax')
])
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()
plt.subplot(1, 2, 2)
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()
【Code Answer】
import numpy as np
plt.xlabel("Day")
plt.ylabel("Sales Predictions")
plt.legend()
plt.grid(True)
plt.show()
In this problem, the task involves comparing the predictions from three
different machine learning models on a simulated dataset.The first step is to
create a dataset that represents the daily sales for 20 days.In reality, machine
learning models would be trained on past sales data, but here we simulate
these predictions using random integers.Three different models are created
by generating random predictions: model1_predictions,
model2_predictions, and model3_predictions.The visualization part uses
matplotlib to create a line plot showing how each model's predictions differ
across the 20 days.The plot() function is used to display the predictions of
each model, and different markers ('o', 's', '^') are used to distinguish the
models.Labels for the x-axis (Day) and y-axis (Sales Predictions) are added
to clarify the plot, and a legend is created to identify the lines for each
model.Finally, grid(True) is included to display a grid, making it easier to
interpret the values in the plot.This visualization allows for an easy
comparison of the aggregated predictions, helping to analyze whether the
models provide similar or divergent predictions over time.
【Trivia】
Aggregating results from multiple machine learning models is known as
ensemble learning.One common technique, called "stacking," uses the
predictions from multiple models to create a more robust final prediction,
often improving overall performance.
59. Visualizing the Relationship Between Features
and a Target in a Customer Churn Problem
Importance★★★★☆
Difficulty★★★☆☆
A telecommunications company is trying to understand the factors that lead
to customer churn (when customers stop using their service). You are tasked
with building a machine learning model to predict whether a customer will
churn based on various features like customer age, monthly charges,
contract type, and tenure.Your task is to create a classification model and
visualize the relationship between customer tenure and the probability of
churn. Use a logistic regression model and visualize the relationship by
plotting the predicted probabilities of churn based on customer
tenure.Generate a dataset that contains at least the following features:‣
'tenure': The number of months the customer has been with the company‣
'monthly_charges': The amount the customer pays per month‣ 'age': The
customer's age‣ 'churn': Whether the customer has churned or not (binary
target)Plot a graph showing the relationship between 'tenure' and the
predicted probability of churn.
【Data Generation Code Example】
import numpy as npimport pandas as pdnp.random.seed(42)n =
1000tenure = np.random.randint(1, 61, n)monthly_charges =
np.random.uniform(30, 100, n)age = np.random.randint(18, 75, n)churn =
np.random.choice([0, 1], n, p=[0.7, 0.3])data = pd.DataFrame({'tenure':
tenure, 'monthly_charges': monthly_charges, 'age': age, 'churn': churn})
【Diagram Answer】
【Code Answer】
import numpy as npimport pandas as pdimport matplotlib.pyplot as
pltfrom sklearn.linear_model import LogisticRegressionfrom
sklearn.model_selection import train_test_splitnp.random.seed(42)n =
1000tenure = np.random.randint(1, 61, n)monthly_charges =
np.random.uniform(30, 100, n)age = np.random.randint(18, 75, n)churn =
np.random.choice([0, 1], n, p=[0.7, 0.3])data = pd.DataFrame({'tenure':
tenure, 'monthly_charges': monthly_charges, 'age': age, 'churn': churn})#
Split the data into features and targetX = data[['tenure', 'monthly_charges',
'age']]y = data['churn']# Split data into training and testing setsX_train,
X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)# Initialize and train a logistic regression modelmodel =
LogisticRegression()model.fit(X_train, y_train)# Generate predicted
probabilities for churn based on tenureX_test_sorted =
X_test.sort_values(by='tenure')predicted_probs =
model.predict_proba(X_test_sorted)[:, 1]# Plot the relationship between
tenure and predicted churn probabilityplt.figure(figsize=(10,
6))plt.plot(X_test_sorted['tenure'], predicted_probs, label='Predicted
Probability of Churn')plt.xlabel('Customer Tenure
(Months)')plt.ylabel('Predicted Probability of Churn')plt.title('Churn
Probability vs Customer Tenure')plt.legend()plt.grid(True)plt.show()
【Code Answer】
import numpy as npfrom sklearn.model_selection import
train_test_splitfrom sklearn.datasets import make_classificationimport
tensorflow as tfimport matplotlib.pyplot as plt## Create synthetic
datasetX, y = make_classification(n_samples=1000, n_features=20,
n_classes=2, random_state=0)X_train, X_test, y_train, y_test =
train_test_split(X, y, test_size=0.2, random_state=0)## Define and
compile a simple neural network modelmodel =
tf.keras.Sequential([tf.keras.layers.Dense(16, activation='relu',
input_shape=(X_train.shape[1],)),tf.keras.layers.Dense(8,
activation='relu'),tf.keras.layers.Dense(1,
activation='sigmoid')])model.compile(optimizer='adam',
loss='binary_crossentropy', metrics=['accuracy'])## Train the model while
capturing historyhistory = model.fit(X_train, y_train, epochs=50,
validation_data=(X_test, y_test), verbose=0)## Extract accuracy and loss
from the training historyaccuracy =
history.history['accuracy']val_accuracy =
history.history['val_accuracy']loss = history.history['loss']val_loss =
history.history['val_loss']epochs = range(1, len(accuracy) + 1)## Plot
accuracy and lossfig, ax1 = plt.subplots()## Plot
accuracyax1.plot(epochs, accuracy, 'b-', label='Training
Accuracy')ax1.plot(epochs, val_accuracy, 'g-', label='Validation
Accuracy')ax1.set_xlabel('Epochs')ax1.set_ylabel('Accuracy',
color='b')ax1.tick_params('y', colors='b')## Create a twin y-axis for the
lossax = ax1.twinx()ax.plot(epochs, loss, 'r-', label='Training
Loss')ax.plot(epochs, val_loss, 'orange', label='Validation
Loss')ax.set_ylabel('Loss', color='r')ax.tick_params('y', colors='r')## Add
legends and show the plotfig.legend(loc="upper right", bbox_to_anchor=
(1,1), bbox_transform=ax1.transAxes)plt.title('Training and Validation
Accuracy and Loss Over Time')plt.show()
import pandas as pd
np.random.seed(42)
【Code Answer】
import numpy as np
import pandas as pd
np.random.seed(42)
model = LinearRegression()
X = df[['Advertising Spend']]
y = df['Sales']
model.fit(X, y)
y_pred = model.predict(X)
# Plot residuals
plt.scatter(y_pred, residuals)
plt.axhline(y=0, color='red', linestyle='--')
plt.xlabel('Predicted Sales')
plt.ylabel('Residuals')
plt.show()
np.random.seed(42)
【Code Answer】
import numpy as np
np.random.seed(42)
clusters = kmeans.fit_predict(customers)
plt.xlabel('Purchase Frequency')
plt.legend()
plt.show()
import pandas as pd
np.random.seed(42)
# # Create a DataFrame
【Code Answer】
import numpy as np
import pandas as pd
np.random.seed(42)
X = data['Day'].values.reshape(-1, 1)
y = data['Sales'].values
model = LinearRegression()
model.fit(X, y)
plt.figure(figsize=(10, 6))
plt.xlabel('Day')
plt.ylabel('Sales')
plt.legend()
plt.grid(True)
plt.show()
The goal of this task is to apply machine learning techniques to predict
future values for a time series dataset, specifically sales data. The model
chosen is Linear Regression, which is well-suited for detecting linear
relationships between variables like time and sales.First, we generate
synthetic sales data using a random normal distribution to simulate a
realistic sales pattern over 90 days.We store these values in a Pandas
DataFrame and prepare the data for the machine learning model by
reshaping the 'Day' column as the independent variable (X) and the 'Sales'
column as the dependent variable (y).The Linear Regression model is
trained on the existing data, which helps the model to identify patterns or
trends in sales growth over time. After training, we use the model to predict
sales for the next 30 days, which involves creating a new sequence of future
days and feeding it into the model.Finally, we visualize the results by
plotting both the actual sales and the predicted sales on a graph. The actual
sales are shown for the first 90 days, while the predicted values are plotted
as a dashed line for the next 30 days. This approach enables you to clearly
see how well the model's predictions align with the actual sales trend. The
gridlines, labels, and legend make the graph easier to interpret.
【Trivia】
Linear Regression is one of the simplest and most interpretable machine
learning algorithms. However, for more complex time series data, other
algorithms such as ARIMA, LSTM, or Prophet may provide more accurate
predictions.
64. Visualizing Model Predictions for House Price
Prediction
Importance★★★★☆
Difficulty★★★☆☆
You are working as a data scientist for a real estate agency that wants to
help clients understand the relationship between various house features
(e.g., size, number of rooms, age) and the predicted house prices. Your task
is to build a simple linear regression model using synthetic data and
visualize the model's predictions in comparison to the actual data.The
agency needs a visualization that allows clients to see the model's prediction
for house prices based on square footage. You will need to create a scatter
plot of actual house prices against square footage and overlay it with a line
showing predicted house prices.Using Python, generate synthetic data for
house prices and square footage, fit a linear regression model, and create
the required visualization.The agency expects you to clearly distinguish
between actual and predicted values in the plot, and ensure the plot is easy
for clients to interpret.
【Data Generation Code Example】
import numpy as np
import pandas as pd
【Code Answer】
import numpy as np
import pandas as pd
np.random.seed(42)
X = df[['SquareFootage']]
y = df['HousePrice']
model = LinearRegression()
model.fit(X, y)
predicted_prices = model.predict(X)
plt.xlabel('Square Footage')
plt.legend()
plt.grid(True)
plt.show()
This problem focuses on helping you understand how to use machine
learning models, particularly linear regression, to solve real-world
problems. Linear regression is one of the most basic models used in
machine learning to predict continuous values, and it is important to know
how to visualize the performance of the model using plots.First, synthetic
data for house prices and square footage is generated. In this case, the
square footage is a feature (independent variable), and the house price is the
target variable (dependent variable). The relationship between the two is
modeled by a simple linear regression.Next, the linear regression model
from sklearn is used to fit the relationship between square footage and
house prices. The model uses this relationship to predict house prices based
on the square footage of houses.Once the model is trained, you can use it to
make predictions on the same input data and compare these predictions to
the actual house prices. The matplotlib library is used to create a scatter plot
that shows actual house prices (in blue) and the predicted house prices (as a
red line). The red line represents the model's prediction based on the square
footage.By plotting the actual vs. predicted prices, you can visually inspect
how well the model has captured the underlying trend in the data. The goal
is to overlay the predicted values in such a way that clients can easily
interpret whether the model's predictions are close to reality.This kind of
visualization is useful not only for explaining models to non-technical
audiences but also for assessing model accuracy during development.
【Trivia】
Linear regression is one of the simplest types of machine learning models
but is extremely powerful in many scenarios. Despite its simplicity, it forms
the foundation for many more complex models like polynomial regression
and logistic regression. In real estate, linear models are commonly used to
estimate prices based on features like square footage, location, and number
of rooms.
65. Visualizing Outliers with Boxplots in Machine
Learning Data
Importance★★★★★
Difficulty★★★☆☆
You are working with a retail company that tracks daily sales data to
identify trends and improve customer service.The company wants to find
potential outliers in the sales data that could indicate either technical issues
or special circumstances like promotions.Your task is to visualize the sales
data using a boxplot to detect any significant outliers.Generate synthetic
sales data for 100 days, with sales numbers randomly distributed between
$500 and $1500, but include a few extreme outliers between $2000 and
$3000 to simulate these unusual events.Visualize this data with a boxplot
and explain how the boxplot helps to identify outliers.
【Data Generation Code Example】
import numpy as np
import pandas as pd
np.random.seed(42)
【Code Answer】
import numpy as np
import pandas as pd
np.random.seed(42)
plt.boxplot(sales_data)
plt.ylabel('Sales in USD')
plt.show()
import pandas as pd
np.random.seed(42)
data['contract_type'] = data['contract_type'].map({'month-to-month': 0,
'one-year': 1, 'two-year': 2})
【Diagram Answer】
【Code Answer】
import numpy as np
import pandas as pd
np.random.seed(42)
})
data['contract_type'] = data['contract_type'].map({'month-to-month': 0,
'one-year': 1, 'two-year': 2})
X = data.drop('churn', axis=1)
y = data['churn']
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(f'Accuracy: {accuracy}')
features = X.columns
plt.barh(features, feature_importances)
plt.xlabel('Importance')
grid_size = 5
【Code Answer】
import numpy as np
np.random.seed(42)
grid_size = 5
start = (0, 0)
goal = (4, 4)
episodes = 100
actions = [(0, -1), (0, 1), (-1, 0), (1, 0)] # Left, Right, Up, Down
return next_state
def simulate_episode(q_table):
state = start
path = [state]
path.append(state)
return path
q_table = generate_random_q_table()
paths_over_episodes = []
Visualization
maze = create_maze()
axs[idx].imshow(maze, cmap='gray_r')
axs[idx].set_title(f'Episode {episode}')
axs[idx].legend()
plt.show()
X, y = make_classification(n_samples=200, n_features=2,
n_informative=2, n_redundant=0, n_clusters_per_class=1)
【Diagram Answer】
【Code Answer】
import numpy as np
X, y = make_classification(n_samples=200, n_features=2,
n_informative=2, n_redundant=0, n_clusters_per_class=1)
h = .02
plt.figure(figsize=(15, 5))
clf.fit(X, y)
Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.subplot(1, 3, i + 1)
plt.title(titles[i])
plt.tight_layout()
plt.show()
import pandas as pd
np.random.seed(0)
data = pd.DataFrame({
})
【Code Answer】
import numpy as np
import pandas as pd
data = pd.DataFrame({
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.title('Before Preprocessing')
plt.xlabel('Index')
plt.ylabel('Values')
plt.legend()
# # Step 1: Handle missing values using median strategy
imputer = SimpleImputer(strategy='median')
data['Age'] = imputer.fit_transform(data[['Age']])
iqr = q3 - q1
scaler = StandardScaler()
plt.subplot(1, 2, 2)
plt.title('After Preprocessing')
plt.xlabel('Index')
plt.ylabel('Values (Scaled)')
plt.legend()
plt.tight_layout()
plt.show()
X, y = make_classification(n_samples=1000, n_features=10,
n_informative=5, n_classes=2, random_state=42)
【Diagram Answer】
【Code Answer】
import numpy as np
X, y = make_classification(n_samples=1000, n_features=10,
n_informative=5, n_classes=2, random_state=42)
clf = DecisionTreeClassifier(random_state=42)
grid_search.fit(X_train, y_train)
results = grid_search.cv_results_
depths = np.array(param_grid['max_depth'])
splits = np.array(param_grid['min_samples_split'])
plt.colorbar(label='Accuracy')
plt.xticks(np.arange(len(splits)), splits)
plt.yticks(np.arange(len(depths)), depths)
plt.xlabel('min_samples_split')
plt.ylabel('max_depth')
plt.show()
import pandas as pd
np.random.seed(42)
【Code Answer】
import numpy as np
import pandas as pd
np.random.seed(42)
#Creating a DataFrame
data['anomaly'] = model.fit_predict(data[['value']])
plt.figure(figsize=(10, 6))
plt.xlabel('Date')
plt.ylabel('Sensor Value')
plt.legend()
plt.grid(True)
plt.show()
In this exercise, you are detecting anomalies in time series data using the
Isolation Forest algorithm. Time series anomalies are important in various
industries because they help detect issues such as equipment failure,
network problems, or fraudulent activities.Synthetic Data Generation: The
synthetic time series data is created using numpy and pandas. The values
are mostly generated from a normal distribution (mean 100, standard
deviation 5), which simulates normal sensor behavior. Sudden drops and
spikes are manually added between indexes 50–55 and 150–155 to simulate
abnormal behavior.
Isolation Forest Algorithm: Isolation Forest is an unsupervised machine
learning algorithm designed for anomaly detection. It works by isolating
observations by randomly selecting a feature and splitting the data. The
more splits it takes to isolate a point, the less likely it is an anomaly. The
model is fitted using the fit_predict() method, which outputs -1 for
anomalies and 1 for normal data. We set the contamination level to 0.05,
meaning we expect about 5% of the data to be anomalies.
Visualization: The time series data is plotted using matplotlib, showing
normal sensor values and highlighting anomalies in red. This visualization
helps easily identify the points where the sensor deviated from normal
behavior. The plot includes a title, axis labels, and a legend for clarity.
This task teaches how to detect anomalies in time series data using an
unsupervised machine learning model. It also helps develop skills in
creating synthetic datasets and using visualization to interpret machine
learning results.
【Trivia】
Isolation Forest is a tree-based anomaly detection method that scales well
for high-dimensional data. It was specifically designed for anomaly
detection, unlike other machine learning models adapted for this task.
72. Visualizing the Voting Classifier for Customer
Churn Prediction
Importance★★★★★
Difficulty★★★☆☆
A telecommunications company is concerned about customer churn and
wants to predict whether a customer will leave the company based on
various factors like monthly charges, contract type, and payment method.
They’ve decided to use an ensemble learning model, combining different
classifiers using a Voting Classifier.You are tasked with building a Voting
Classifier with three models: Logistic Regression, Random Forest, and
Support Vector Machine (SVM). Use this ensemble model to predict
customer churn. Finally, visualize the decision boundary of the ensemble
model along with each individual classifier.Generate a synthetic dataset
with customer features: 'MonthlyCharges' and 'ContractType'.Implement a
Voting Classifier (with majority voting).Visualize the decision boundaries
of each individual classifier and the ensemble model.
【Data Generation Code Example】
import numpy as np
X, y = make_classification(n_samples=500, n_features=2,
n_informative=2, n_redundant=0, n_clusters_per_class=1,
random_state=42)
【Code Answer】
import numpy as np
X, y = make_classification(n_samples=500, n_features=2,
n_informative=2, n_redundant=0, n_clusters_per_class=1,
random_state=42)
clf1 = LogisticRegression()
clf2 = RandomForestClassifier(n_estimators=100)
clf3 = SVC(probability=True)
clf1.fit(X_train, y_train)
clf2.fit(X_train, y_train)
clf3.fit(X_train, y_train)
voting_clf.fit(X_train, y_train)
Z1 = clf1.predict(grid).reshape(xx.shape)
Z = clf2.predict(grid).reshape(xx.shape)
Z3 = clf3.predict(grid).reshape(xx.shape)
Z_voting = voting_clf.predict(grid).reshape(xx.shape)
plt.figure(figsize=(14, 10))
# Logistic Regression
plt.subplot(2, 2, 1)
plt.title('Logistic Regression')
# Random Forest
plt.subplot(2, 2, 2)
plt.title('Random Forest')
plt.subplot(2, 2, 3)
plt.title('SVM')
# Voting Classifier
plt.subplot(2, 2, 4)
plt.title('Voting Classifier')
plt.tight_layout()
plt.show()
In this task, you are asked to build a Voting Classifier that combines three
different models: Logistic Regression, Random Forest, and Support Vector
Machine (SVM). The Voting Classifier aggregates the predictions from
these individual classifiers and makes the final prediction based on a
majority vote, or in this case, a soft vote (based on the predicted
probabilities). This allows the strengths of different classifiers to
complement each other and improve prediction performance.
The first step is to generate synthetic data using make_classification. This
function creates a dataset with two features that can be used to separate two
classes. We split this dataset into training and testing sets to train the
models.
After that, we initialize each individual classifier. Logistic Regression is a
linear model that is easy to interpret, while Random Forest is an ensemble
model that uses decision trees. SVM is effective in high-dimensional
spaces, and in this case, we use it with probability estimates for soft voting.
Next, we combine these classifiers into a VotingClassifier. The voting='soft'
parameter means that the final prediction is based on the average of
predicted probabilities from the individual classifiers. This soft voting
approach is often more robust than hard voting, where only the final
predicted labels are used.
To visualize how each classifier makes decisions, we create a mesh grid
over the feature space. For each point in this grid, the classifiers predict
whether it belongs to class 0 or class 1. The results are then plotted to show
the decision boundaries. By comparing the plots, you can see how each
model’s decision-making differs and how the ensemble Voting Classifier
aggregates them.
Finally, the plt.contourf function is used to plot the decision regions, while
the plt.scatter function is used to plot the actual data points. Each subplot
represents the decision boundary of one classifier, allowing you to see how
they each divide the feature space differently and how the Voting Classifier
integrates them into a final decision boundary.
This exercise demonstrates the power of ensemble learning, especially
when different types of models are combined to create a more robust
predictor. The visual comparison of decision boundaries helps you better
understand how each model contributes to the ensemble.
【Trivia】
The Voting Classifier works particularly well when the individual classifiers
are diverse in their approach to learning the data. This diversity helps
ensure that errors made by one model are corrected by others, making the
ensemble model more resilient.
73. Visualizing the Impact of Train-Test Splits in
Machine Learning
Importance★★★★★
Difficulty★★★☆☆
You are a data scientist at a company that develops predictive models for
real estate prices.The company wants to understand how different train-test
splits affect the performance of a linear regression model predicting house
prices.Your task is to visualize the performance of the model using Mean
Squared Error (MSE) as a metric by splitting the data into different training
and testing proportions (80%-20%, 70%-30%, 60%-40%).Generate
synthetic data for housing prices using features like size,
number_of_bedrooms, and age_of_house.Build and evaluate linear
regression models for the different splits, and plot the MSE values for each
split to compare model performance.
【Data Generation Code Example】
import numpy as np
【Code Answer】
import numpy as np
mse_values = []
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
mse_values.append(mse)
plt.xlabel('Train-Test Split')
plt.show()
In this exercise, you are required to explore how changing the train-test split
ratio affects the performance of a machine learning model.
We use linear regression, a supervised learning algorithm, to predict house
prices based on features like size, the number of bedrooms, and the age of
the house.
First, we generate synthetic data where X contains the features (size,
number of bedrooms, and age of the house), and y contains the house prices
calculated based on a linear equation.
This synthetic data allows us to simulate a real-world scenario of predicting
house prices using basic features.
We then use different proportions for splitting the data into training and
testing sets: 80%-20%, 70%-30%, and 60%-40%.
The goal is to evaluate how the model's performance, measured by the
Mean Squared Error (MSE), changes when the amount of training data is
increased or decreased.
The MSE is used as the performance metric, which measures how well the
model's predictions match the actual values. A lower MSE indicates better
performance.
We train the model on the training set and predict prices on the test set for
each split. We then compute the MSE for each split and store these values.
Finally, we plot the MSE for each of the splits to visualize how model
performance changes with different train-test ratios.
This is a common task when building machine learning models, as the
quality of the training data has a significant impact on the model's accuracy
and generalization ability.
【Trivia】
In real-world machine learning, different train-test splits can have a
significant impact on the model’s performance, especially with smaller
datasets.
While 80%-20% is a common default split, in some domains, such as
finance or medicine, even smaller splits like 90%-10% may be used to
maximize the use of training data.
74. Using Violin Plots to Visualize Categorical Data
in Customer Satisfaction
Importance★★★★☆
Difficulty★★★☆☆
A marketing team for an online retail company wants to analyze customer
satisfaction based on product category. The company has data on customer
satisfaction ratings (on a scale of 1 to 5) for several different product
categories (e.g., electronics, clothing, furniture).Your task is to visualize the
distribution of customer satisfaction ratings for each product category using
a violin plot. This will help the marketing team understand the spread of
ratings and identify categories that may need improvement.Generate a
random dataset with 3 product categories (Electronics, Clothing, Furniture)
and 200 customer ratings for each category. Use this data to create a violin
plot showing the distribution of ratings per category.
【Data Generation Code Example】
import numpy as np
import pandas as pd
np.random.seed(42)
【Code Answer】
import numpy as np
import pandas as pd
plt.figure(figsize=(8, 6))
sns.violinplot(x='Category', y='Rating', data=data)
plt.xlabel('Product Category')
plt.ylabel('Customer Satisfaction Rating')
plt.show()
In this task, you are required to visualize customer satisfaction ratings using
a violin plot, which is a combination of a box plot and a kernel density plot.
It shows not only the summary statistics like the median and quartiles but
also the density of the data at various points, making it easier to understand
the spread and skewness of the data.The first step is to generate the dataset.
In this case, random integer values between 1 and 5 (representing customer
satisfaction ratings) are created using np.random.randint. You generate 600
ratings in total, divided randomly among three product categories:
Electronics, Clothing, and Furniture. This step simulates a realistic scenario
where customers provide ratings for various products.Once the data is
prepared, you use the Seaborn library to create the violin plot. Seaborn’s
violinplot function is ideal for showing the distribution of numerical data
across different categories. By setting the x-axis to Category and the y-axis
to Rating, you display how customer satisfaction is distributed across the
three product types. This is useful in machine learning when exploring
feature distributions or evaluating how different categories affect a target
variable.Finally, plt.show() is used to display the plot. The plot gives you
insights into how satisfaction ratings are spread within each product
category, which is valuable for decision-making in customer satisfaction
improvement strategies.
【Trivia】
Violin plots are especially useful in machine learning for feature
engineering and exploratory data analysis (EDA). By visualizing the
distribution of categorical features, you can quickly identify imbalances,
outliers, or clusters that may influence the model’s predictions.
75. Visualizing Feature Selection Process in
Machine Learning
Importance★★★★☆
Difficulty★★★☆☆
You are a data analyst working for a company that provides a health and
fitness tracking app. Your manager has tasked you with improving the
prediction accuracy of an algorithm that estimates a person's health score
based on various factors such as heart rate, step count, and sleep data. Your
goal is to visualize which features in the dataset are most important for
predicting the health score, so the team can optimize data collection for
more accurate predictions.Create a Python script to select the most relevant
features for predicting health score and visualize the results using a feature
importance plot. The dataset should have 10 features and 1 target variable,
with at least 200 records generated randomly.
【Data Generation Code Example】
import numpy as np
import pandas as pd
np.random.seed(42)
【Code Answer】
import numpy as np
import pandas as pd
np.random.seed(42)
X = data.drop(columns=['health_score'])
y = data['health_score']
model.fit(X_train, y_train)
feature_names = X.columns
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.show()
X, y = make_classification(n_samples=500, n_features=10,
n_informative=8, n_classes=2, random_state=42)
【Diagram Answer】
【Code Answer】
import numpy as np
X, y = make_classification(n_samples=500, n_features=10,
n_informative=8, n_classes=2, random_state=42)
# # Split the data into training and test sets
model = RandomForestClassifier(random_state=42)
grid_search.fit(X_train, y_train)
results = grid_search.cv_results_
n_estimators_vals = results['param_n_estimators'].data
max_depth_vals = results['param_max_depth'].data
mean_test_scores = results['mean_test_score']
plt.figure(figsize=(10, 6))
plt.colorbar(label='Mean Accuracy')
plt.xlabel('Number of Estimators')
plt.ylabel('Max Depth')
plt.show()
import pandas as pd
#Generate synthetic sales data with different trends for each region
【Code Answer】
import numpy as np
import pandas as pd
#Preparing data
X = data[['marketing_spend', 'week']]
y = data['sales']
#Model training
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
regions_unique = data['region'].unique()
performance_by_region = {}
y_subset = subset['sales']
y_pred_subset = model.predict(X_subset)
#Visualizing performance
regions_list = list(performance_by_region.keys())
r2_scores = list(performance_by_region.values())
plt.figure(figsize=(10, 6))
plt.bar(regions_list, r2_scores)
plt.xlabel('Region')
plt.ylabel('R² Score')
plt.show()
import pandas as pd
df['Purchase'] = labels
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
【Diagram Answer】
【Code Answer】
import numpy as np
import pandas as pd
df['Purchase'] = labels
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
def build_and_evaluate_model(layers):
plt.plot(layers_range, accuracies)
plt.ylabel('Accuracy')
plt.grid(True)
plt.show()
In this exercise, you are building a neural network using the Sequential API
from the tensorflow.keras library.
The goal is to understand how adding more hidden layers to the model
impacts its accuracy in predicting whether a customer will make a purchase.
First, we generate synthetic data to simulate customers with three features:
age, income, and shopping habits.
We assume customers with higher income are more likely to purchase, so
the purchase label is based on whether the income is over 100,000.
This data is then split into training and test sets.
We scale the features using StandardScaler to improve model performance
by ensuring all inputs are on the same scale.
The neural network is created with a flexible number of hidden layers,
controlled by the build_and_evaluate_model function.
Each layer uses the relu activation function, which introduces non-linearity
and helps the network learn complex patterns.
The output layer has one neuron and uses the sigmoid activation function,
suitable for binary classification tasks.
The model is trained using binary crossentropy as the loss function and the
Adam optimizer.
After training, we evaluate the model's performance using accuracy as a
metric.
Finally, the script generates a plot to show the relationship between the
number of hidden layers and model accuracy.
This allows you to visualize how model complexity (in terms of hidden
layers) influences predictive performance.
【Trivia】
Did you know? The relu activation function is often preferred in neural
networks because it reduces the likelihood of vanishing gradients, a
common issue in deep learning. This helps the model train faster and
achieve better performance in many cases.
79. Visualizing the Gradient Boosting Process for
Customer Churn Prediction
Importance★★★★★
Difficulty★★★☆☆
A telecom company wants to predict customer churn, as retaining
customers is critical for its business success. The dataset includes customer
information such as monthly charges, contract duration, and service usage
details. Your task is to train a Gradient Boosting model to predict whether a
customer is likely to churn or not based on this dataset. After training the
model, visualize the contribution of each feature (importance) in the
decision-making process by generating a plot of feature
importances.Implement the following steps:Generate a synthetic dataset to
simulate customer data (features such as monthly charges, contract
duration, etc.).Train a Gradient Boosting Classifier to predict churn.Plot a
feature importance graph to visualize which features contribute most to the
prediction.
【Data Generation Code Example】
import numpy as np
import pandas as pd
data = pd.DataFrame({
'MonthlyCharges': np.random.randint(20, 100, 1000),
})
X = data.drop('Churn', axis=1)
y = data['Churn']
【Code Answer】
import numpy as np
import pandas as pd
data = pd.DataFrame({
})
X = data.drop('Churn', axis=1)
y = data['Churn']
gb_model = GradientBoostingClassifier(n_estimators=100,
random_state=42)
gb_model.fit(X_train, y_train)
feature_importances = gb_model.feature_importances_
features = X.columns
plt.figure(figsize=(8, 6))
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.title('Feature Importance in Gradient Boosting Model')
plt.show()
【Code Answer】
import numpy as np
import matplotlib.pyplot as plt
model.compile(optimizer='adam', loss='mse')
preds = []
input_seq = data[-100:]
for _ in range(30):
preds.append(pred[0][0])
plt.figure(figsize=(10, 6))
plt.ylabel("Sales")
plt.legend()
plt.show()
In this exercise, you are asked to use a Recurrent Neural Network (RNN) to
predict future values in a time-series dataset. The task begins by generating
synthetic data that mimics real-world product sales over a period of 500
days. We add a sine wave and random noise to simulate the fluctuations and
unpredictability seen in real sales data.To prepare the data for the RNN, we
need to break it down into sequences. For each sequence of 100 past days,
we will train the model to predict the sales value for the next day. This kind
of setup, where previous data is used to predict the future, is common in
time-series forecasting problems.The RNN model is built using the
Sequential API from Keras, with a SimpleRNN layer and a Dense layer.
The RNN layer has 50 units and uses the tanh activation function, which is
commonly used for recurrent layers due to its ability to handle time
dependencies. The final layer is a Dense layer with a single neuron, which
outputs the predicted sales value for the next day.We train the model using
the Adam optimizer and mean squared error (MSE) as the loss function.
After training, the model is used to predict sales for the next 30 days, using
the last 100 days' data as input. This process is repeated iteratively: after
each prediction, the predicted value is added to the input sequence, and the
oldest value is dropped, allowing us to make consecutive forecasts.Finally,
we visualize the results by plotting the actual sales values alongside the
predicted values for the next 30 days. The plot helps us assess how well the
model is forecasting the future sales and allows us to visually compare the
model's performance. This visualization is crucial for understanding the
model's strengths and weaknesses when applied to time-series forecasting
tasks.
【Trivia】
RNNs are particularly suited for time-series data because they retain
information from previous time steps, allowing them to capture temporal
dependencies. However, in practice, they are often replaced by more
advanced models like LSTMs (Long Short-Term Memory) or GRUs (Gated
Recurrent Units), which address the problem of vanishing gradients and can
remember dependencies over longer time periods.
81. Visualizing Data Transformation in Machine
Learning
Importance★★★★☆
Difficulty★★★☆☆
A retail company wants to analyze the sales performance of its various
stores based on monthly sales data. The goal is to identify any trends or
patterns that could inform their decision-making for optimizing inventory
levels and marketing efforts. You are asked to process the sales data for five
stores over a period of 12 months, visualize the transformed data, and
explain how data transformation improves the model training process.Your
task is to:Create a dataset that simulates monthly sales data for five
stores.Normalize this data using Min-Max scaling to make it suitable for
machine learning models.Visualize the original data and the normalized
data side-by-side using line plots to show how data transformation impacts
the scale of the features.
【Data Generation Code Example】
import numpy as np
import pandas as pd
np.random.seed(42)
【Code Answer】
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
np.random.seed(42)
#Generate random sales data for 5 stores over 12 months
scaler = MinMaxScaler()
normalized_data = pd.DataFrame(scaler.fit_transform(sales_data),
index=months, columns=sales_data.columns)
plt.figure(figsize=(10, 6))
plt.subplot(1, 2, 1)
plt.plot(sales_data.index, sales_data)
plt.xlabel('Month')
plt.ylabel('Sales')
plt.subplot(1, 2, 2)
plt.plot(normalized_data.index, normalized_data)
plt.xlabel('Month')
plt.ylabel('Normalized Sales')
plt.tight_layout()
plt.show()
In machine learning, data preprocessing is a critical step that directly affects
the performance of models. One common preprocessing technique is data
normalization. In this exercise, we used Min-Max scaling to normalize sales
data for multiple stores. Min-Max scaling transforms the data to a fixed
range, typically between 0 and 1, by subtracting the minimum value of a
feature and dividing by the range (max - min). This ensures that all features
are on the same scale, which is important for algorithms like gradient
descent-based methods, where large feature values could dominate smaller
ones, leading to biased results.First, we generated a dataset that simulated
the sales figures for five stores across 12 months. The raw data varied
widely in scale, which can be problematic for machine learning models. To
address this, we applied Min-Max scaling, which rescaled all values to a
range between 0 and 1.Visualizing both the original and normalized data
helps us understand how scaling affects the distribution and scale of the
data. In the plot of the original data, the sales values differ significantly
between stores, which could introduce bias during model training. In
contrast, the normalized data shows all stores having values within the same
range, making it easier for models to learn without any store having undue
influence on the outcome.Data normalization is particularly useful when
using algorithms that rely on distance metrics (such as K-nearest neighbors
or support vector machines) or optimization algorithms like gradient
descent. These models perform better when the input features are scaled
consistently. Normalization also helps prevent numerical instability and can
speed up the convergence of models.In conclusion, transforming data
through techniques like Min-Max scaling helps ensure that machine
learning models can effectively process and learn from features that might
originally be on vastly different scales.
【Trivia】
Did you know that normalization is not always necessary? Some machine
learning algorithms, like decision trees or random forests, are not sensitive
to feature scaling because they make decisions based on feature thresholds
rather than distances. However, for models relying on distance or
optimization, normalization can be crucial!
82. Visualizing the Impact of Feature Scaling in
Machine Learning
Importance★★★★★
Difficulty★★★☆☆
A financial institution has a customer dataset containing various features
such as income, loan amount, and credit score.Your task is to build a
machine learning model that predicts whether a customer will default on a
loan based on these features.Before building the model, you need to
visualize how scaling affects the features and the model performance.Create
a synthetic dataset with features: income, loan amount, credit score, and a
binary target variable (loan default: 0 or 1).Plot the features using scatter
plots before and after scaling.Use StandardScaler for scaling.Demonstrate
the importance of feature scaling by observing how different scales impact
the model's accuracy.
【Data Generation Code Example】
import numpy as np
import pandas as pd
np.random.seed(42)
n_samples = 100
data = {
df = pd.DataFrame(data)
【Diagram Answer】
【Code Answer】
import numpy as np
import pandas as pd
np.random.seed(42)
data = {
}
df = pd.DataFrame(data)
y = df['default']
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.title('Before Scaling')
plt.xlabel('Income')
plt.ylabel('Loan Amount')
X_scaled = scaler.fit_transform(X)
plt.subplot(1, 2, 2)
plt.title('After Scaling')
plt.xlabel('Income (scaled)')
plt.ylabel('Loan Amount (scaled)')
plt.tight_layout()
plt.show()
X, y = make_classification(n_samples=1000, n_features=20,
n_informative=15, random_state=42)
【Diagram Answer】
【Code Answer】
import numpy as np
X, y = make_classification(n_samples=1000, n_features=20,
n_informative=15, random_state=42)
# Initialize models
plt.figure(figsize=(10, 6))
plt.xlabel('Accuracy')
plt.ylabel('Frequency')
plt.legend(loc='best')
plt.show()
【Code Answer】
import numpy as np
## Algorithms initialization
models = {
}
## Store training times
for X, y in datasets:
start_time = time()
model.fit(X, y)
training_times[model_name].append(time() - start_time)
plt.legend()
plt.show()
X, y = make_classification(n_samples=1000, n_features=10,
n_informative=5, n_classes=2)
【Code Answer】
import numpy as np
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1000, n_features=10,
n_informative=5, n_classes=2)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
clf = DecisionTreeClassifier()
grid_search.fit(X_train, y_train)
scores = results['mean_test_score'].reshape(len(param_grid['max_depth']),
len(param_grid['min_samples_split']))
plt.figure(figsize=(10, 6))
sns.heatmap(scores, annot=True,
xticklabels=param_grid['min_samples_split'],
yticklabels=param_grid['max_depth'])
plt.xlabel('min_samples_split')
plt.ylabel('max_depth')
plt.show()
df = pd.DataFrame({
'churn': [1, 0, 0, 1, 0, 1, 0, 1, 0]
})
y = df['churn']
【Code Answer】
import pandas as pd
df = pd.DataFrame({
'tenure': [1, 3, 4, 10, 12, 24, 36, 48, 60],
'monthly_charges': [20, 30, 40, 50, 60, 70, 80, 90, 100],
'churn': [1, 0, 0, 1, 0, 1, 0, 1, 0]
})
y = df['churn']
clf.fit(X_train, y_train)
plt.show()
import pandas as pd
【Code Answer】
import numpy as np
import pandas as pd
X_test = np.array(X_test).reshape(-1, 1)
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
plt.figure(figsize=(10,6))
plt.xlabel('Date')
plt.ylabel('Sales')
plt.legend()
plt.grid(True)
plt.show()
This exercise walks you through building a time series forecasting model
and visualizing the errors in the predictions.First, we generate synthetic
sales data using a sine wave and random noise to simulate realistic sales
fluctuations. The pd.date_range function is used to generate daily
timestamps, and sales are modeled as a mix of sinusoidal trends and random
noise.Once we have the data, we convert the Date column into numerical
values (days since the first date) so that the linear regression model can use
it as a feature. Time series data typically needs to be numerically
represented for regression models.We then split the data into training and
testing sets. Since time series data relies on temporal order, we do not
shuffle the data, as this would break the temporal correlation. The training
set consists of the earlier part of the data, and the test set consists of the
more recent part.The model is a simple LinearRegression from scikit-learn.
It is trained on the date (in numerical format) and sales values. After
training, we predict sales for the test period.The key focus is the
visualization of errors. After plotting both the actual and predicted sales, we
compute the difference (error) between actual and predicted values. This
error is plotted as a dashed line to highlight where and how much the
prediction deviates from reality.This exercise not only emphasizes
forecasting but also visual error analysis, which is critical when evaluating
the performance of time series models. Understanding these errors helps
improve models and provides insights into trends that the model may have
missed.
【Trivia】
Did you know that linear regression, one of the simplest machine learning
algorithms, has been around since the 1800s? It was originally developed
by Francis Galton to study the inheritance of traits and was later expanded
by Karl Pearson. Despite its simplicity, it is still widely used today for
various types of predictive modeling, including time series forecasting!
88. Visualizing the Results of a Machine Learning
Classification Task
Importance★★★★☆
Difficulty★★★☆☆
You are working for a company that wants to predict customer churn based
on user data.The goal is to train a machine learning model to classify
whether a customer will churn or not, based on certain features.After
training the model, the company wants to visualize how well the model
performs using a confusion matrix and classification report.Your task is
to:Create sample data for customer churn prediction (e.g., 'age',
'monthly_spending', 'years_with_company').Train a machine learning
model on this data.Visualize the results using a confusion matrix and show
the model’s performance through a classification report.
【Data Generation Code Example】
import numpy as np
import pandas as pd
【Code Answer】
import numpy as np
import pandas as pd
model = RandomForestClassifier()
model.fit(X_train, y_train)
# # Make predictions #
y_pred = model.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
ConfusionMatrixDisplay(confusion_matrix=cm).plot()
plt.title('Confusion Matrix')
plt.show()