ML - Lab Manual
ML - Lab Manual
# Summary statistics
print(df.describe())
# Load dataset
iris = load_iris()
# Create a DataFrame
data = pd.DataFrame(iris.data, columns=iris.feature_names)
data['target'] = iris.target
Step 2
Explore the Dataset:
Step 3
Apply Normalization:
# Initialize scaler
scaler = MinMaxScaler()
# Convert to DataFrame
normalized_df = pd.DataFrame(normalized_data, columns=iris.feature_names)
print(normalized_df.head())
Step 4
Visualize the Normalized Data:
Results
The dataset has been normalized, ensuring all numeric features lie within the range [0, 1]. This
helps improve the performance and convergence rate of machine learning models.
Conclusion
Normalization ensures that all features contribute equally to the model training process by
scaling them to a similar range. This is particularly important for distance-based algorithms.
Viva Questions
1. What is the difference between normalization and standardization?
2. Why is normalization important for distance-based algorithms like k-NN or SVM?
3. How does Min-Max normalization work?
3. Apply Dimensionality Reduction Techniques (e.g., PCA) to a Dataset and Analyze the Results
Objective
To reduce the dimensionality of a dataset using Principal Component Analysis (PCA) and
analyze the transformed dataset.
Prerequisites
- Python programming.
- Libraries: pandas, numpy, sklearn, matplotlib, seaborn.
Dataset Description
Dataset Used: Breast Cancer dataset.
Source: Available in the sklearn library.
Features:
- Various diagnostic measurements of breast tumors.
- Target: Benign (0) or Malignant (1).
Procedure
Step 1
Load the Dataset:
# Load dataset
cancer = load_breast_cancer()
# Create a DataFrame
data = pd.DataFrame(cancer.data, columns=cancer.feature_names)
data['target'] = cancer.target
Step 2
Explore the Dataset:
# Basic information
print(data.info())
# Summary statistics
print(data.describe())
Step 3
Apply PCA:
# Apply PCA
pca = PCA(n_components=2)
principal_components = pca.fit_transform(scaled_data)
print(pca_df.head())
Step 4
Analyze the Results:
Step 5
Visualize the PCA Results:
Results
The dataset dimensionality was successfully reduced to two principal components, explaining a
significant proportion of the variance. The transformed dataset was visualized to highlight
patterns and separability of the classes.
Conclusion
PCA helps in reducing the dimensionality of a dataset by retaining the most important features
that explain the majority of the variance. It aids in visualizing high-dimensional data and
reducing computational complexity.
Viva Questions
1. What is the purpose of dimensionality reduction?
2. How does PCA differ from feature selection?
3. What is the role of standardization in PCA?
4. Implement K-Means Clustering Algorithm on a Dataset
Objective
To apply the K-Means clustering algorithm on a dataset to identify clusters within the data and
analyze the results.
Prerequisites
Python programming.
Libraries: pandas, numpy, sklearn, matplotlib, seaborn.
Dataset Description
Procedure
Step 1: Load the Dataset
from sklearn.datasets import load_iris
import pandas as pd
# Load dataset
iris = load_iris()
# Create a DataFrame
data = pd.DataFrame(iris.data, columns=iris.feature_names)
data['target'] = iris.target
# Summary statistics
print(data.describe())
Step 3: Apply K-Means Clustering
from sklearn.cluster import KMeans
# Define the model with the number of clusters (k=3 for Iris dataset)
kmeans = KMeans(n_clusters=3, random_state=42)
Results
The dataset was successfully clustered into three groups using K-Means. The cluster centers and
inertia values were calculated. The visualization helped to observe the separation of the clusters
based on Sepal length and Sepal width.
Conclusion
K-Means clustering is effective in identifying groups within the dataset, especially when the
number of clusters (k) is known or can be determined through techniques like the elbow method.
The clustering results can be analyzed through the center points and inertia to assess the quality
of the model.
Viva Questions
To apply the Hierarchical Clustering algorithm to a dataset and visualize the results to identify
clusters.
Prerequisites
Python programming.
Libraries: pandas, numpy, sklearn, matplotlib, seaborn, scipy.
Dataset Description
Procedure
Step 1: Load the Dataset
from sklearn.datasets import load_iris
import pandas as pd
# Load dataset
iris = load_iris()
# Create a DataFrame
data = pd.DataFrame(iris.data, columns=iris.feature_names)
data['target'] = iris.target
# Summary statistics
print(data.describe())
Step 3: Apply Hierarchical Clustering
from sklearn.preprocessing import StandardScaler
from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.cluster import AgglomerativeClustering
Results
The hierarchical clustering algorithm successfully grouped the data into three clusters. A
dendrogram was used to visualize the clustering process. The final clusters were assigned based
on the distance threshold in the linkage matrix.
Conclusion
Hierarchical clustering provides a powerful method to visualize and group data without needing
a predefined number of clusters. The Dendrogram helps in determining the appropriate number
of clusters by observing the distance at which data points merge.
Viva Questions
1. What is hierarchical clustering, and how does it differ from K-means clustering?
2. What is the significance of the linkage method in hierarchical clustering?
3. How do you interpret a dendrogram, and how can it help in determining the number of
clusters?
4. What are the advantages and limitations of hierarchical clustering?
6. Implement Density-Based Clustering Algorithm (DBSCAN) on a Dataset
Objective
To apply the DBSCAN (Density-Based Spatial Clustering of Applications with Noise) algorithm
to a dataset to identify clusters and outliers based on density.
Prerequisites
Python programming.
Libraries: pandas, numpy, sklearn, matplotlib, seaborn.
Dataset Description
Procedure
Step 1: Load the Dataset
from sklearn.datasets import load_iris
import pandas as pd
# Load dataset
iris = load_iris()
# Create a DataFrame
data = pd.DataFrame(iris.data, columns=iris.feature_names)
data['target'] = iris.target
# Summary statistics
print(data.describe())
# Apply DBSCAN
dbscan = DBSCAN(eps=0.5, min_samples=5)
data['cluster'] = dbscan.fit_predict(scaled_data)
eps (epsilon): Defines the maximum distance between two samples for them to be considered
as in the same neighborhood.
min_samples: The number of samples in a neighborhood for a point to be considered as a core
point.
Results
The DBSCAN algorithm successfully grouped the data into clusters based on density. Outliers or
noise points are labeled as -1. The clustering results were visualized using a scatter plot of Sepal
length vs. Sepal width.
Conclusion
Viva Questions
To apply the Apriori algorithm for discovering frequent itemsets and generating association rules
from a dataset.
Prerequisites
Python programming.
Libraries: mlxtend, pandas.
Dataset Description
Procedure
Step 1: Install Required Libraries
To use the Apriori algorithm, we need the mlxtend library. If you don’t have it installed, you can
install it via pip:
# Sample transactional data (each row represents a transaction with items bought)
data = [
['Milk', 'Bread', 'Butter'],
['Beer', 'Diaper', 'Milk', 'Bread'],
['Milk', 'Diaper', 'Beer', 'Cola'],
['Bread', 'Milk', 'Butter'],
['Diaper', 'Milk', 'Bread', 'Butter']
]
# Convert to a DataFrame
df = pd.DataFrame(data, columns=['Item 1', 'Item 2', 'Item 3', 'Item 4'])
df = df.apply(lambda x: x.dropna().tolist(), axis=1)
We need to convert the dataset into a format that can be processed by the Apriori algorithm. This
is usually done by converting the dataset into a one-hot encoded format.
# Convert to a DataFrame
encoded_df = pd.DataFrame(te_ary, columns=te.columns_)
print(encoded_df.head())
Now, we apply the Apriori algorithm to find frequent itemsets with a specified minimum
support.
# Apply apriori algorithm to find frequent itemsets with a minimum support of 0.6
frequent_itemsets = apriori(encoded_df, min_support=0.6, use_colnames=True)
Once the frequent itemsets are found, we generate association rules based on the frequent
itemsets.
Results
The Apriori algorithm identifies frequent itemsets based on the minimum support threshold, and
the association rules show relationships between items that often appear together in transactions.
The lift metric indicates the strength of the rule.
Conclusion
The Apriori algorithm is widely used in market basket analysis and other fields where
association rule mining is necessary. It helps in finding patterns in large datasets, such as items
that are frequently bought together. The minimum support and lift thresholds are key parameters
that control the output.
Viva Questions
To apply the FP-Growth (Frequent Pattern Growth) algorithm to find frequent itemsets in a
dataset. The FP-Growth algorithm is an efficient alternative to the Apriori algorithm, particularly
useful for large datasets.
Prerequisites
Python programming.
Libraries: mlxtend, pandas.
Dataset Description
Procedure
Step 1: Install Required Libraries
To use the FP-Growth algorithm, we need the mlxtend library. If you don’t have it installed, you
can install it via pip:
# Sample transactional data (each row represents a transaction with items bought)
data = [
['Milk', 'Bread', 'Butter'],
['Beer', 'Diaper', 'Milk', 'Bread'],
['Milk', 'Diaper', 'Beer', 'Cola'],
['Bread', 'Milk', 'Butter'],
['Diaper', 'Milk', 'Bread', 'Butter']
]
# Convert to a DataFrame
df = pd.DataFrame(data, columns=['Item 1', 'Item 2', 'Item 3', 'Item 4'])
df = df.apply(lambda x: x.dropna().tolist(), axis=1)
We need to convert the dataset into a format that can be processed by the FP-Growth algorithm.
This is usually done by converting the dataset into a one-hot encoded format.
# Convert to a DataFrame
encoded_df = pd.DataFrame(te_ary, columns=te.columns_)
print(encoded_df.head())
Now, we apply the FP-Growth algorithm to find frequent itemsets with a specified minimum
support.
# Apply FP-Growth algorithm to find frequent itemsets with a minimum support of 0.6
frequent_itemsets = fpgrowth(encoded_df, min_support=0.6, use_colnames=True)
Once the frequent itemsets are found, you can generate association rules based on these itemsets.
The FP-Growth algorithm identifies frequent itemsets based on the minimum support threshold.
It is more efficient than the Apriori algorithm and does not generate candidate itemsets, making
it faster for larger datasets. The association rules, if generated, show relationships between items
that often appear together in transactions.
Conclusion
The FP-Growth algorithm is highly efficient for mining frequent itemsets in large datasets. It
reduces the computational complexity compared to the Apriori algorithm by eliminating the need
to generate candidate itemsets. This algorithm is ideal for large-scale market-basket analysis and
other applications requiring frequent pattern mining.
Viva Questions
1. What is the FP-Growth algorithm, and how does it differ from the Apriori algorithm?
2. What is the concept of support in frequent itemset mining?
3. How does the FP-Growth algorithm handle the problem of candidate generation in
Apriori?
4. How do the parameters support and lift affect the results in FP-Growth?
9. Implement Decision Tree Classification Algorithm on a Labeled Dataset
Objective
To implement the Decision Tree classification algorithm to classify data based on a labeled
dataset.
Prerequisites
Python programming.
Libraries: pandas, numpy, sklearn, matplotlib, seaborn.
Dataset Description
Procedure
Step 1: Install Required Libraries
Make sure you have the required libraries installed. You can install them using pip:
import pandas as pd
from sklearn.datasets import load_iris
# Convert to a DataFrame
data = pd.DataFrame(iris.data, columns=iris.feature_names)
data['target'] = iris.target
# View the first few rows
print(data.head())
Step 3: Split the Dataset into Training and Testing Sets
We will split the dataset into training and testing sets to evaluate the performance of the model.
# Split the dataset into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Now, we will initialize and train the Decision Tree classifier using the training data.
After training the model, we can use it to predict the target values on the test data.
We will evaluate the model using accuracy, confusion matrix, and classification report.
# Accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
# Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", conf_matrix)
# Classification Report
class_report = classification_report(y_test, y_pred)
print("Classification Report:\n", class_report)
It’s useful to visualize the decision tree to understand how it makes decisions.
Results
Conclusion
The Decision Tree classifier is an effective model for classification tasks, particularly for small
datasets. It’s interpretable, as shown in the visualized decision tree, which can help in
understanding the decisions the model makes.
Viva Questions
To implement the k-Nearest Neighbors (k-NN) classification algorithm on a labeled dataset and
evaluate its performance.
Prerequisites
Python programming.
Libraries: pandas, numpy, sklearn, matplotlib, seaborn.
Dataset Description
Procedure
Step 1: Install Required Libraries
import pandas as pd
from sklearn.datasets import load_iris
# Convert to a DataFrame
data = pd.DataFrame(iris.data, columns=iris.feature_names)
data['target'] = iris.target
# View the first few rows
print(data.head())
We will split the data into training and testing sets (80% for training, 20% for testing) for model
evaluation.
# Split the dataset into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Now, we'll initialize and train the k-NN classifier using the training data.
After training the model, use it to predict the class labels on the test set.
We will evaluate the model's performance using accuracy, confusion matrix, and classification
report.
# Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", conf_matrix)
# Classification Report
class_report = classification_report(y_test, y_pred)
print("Classification Report:\n", class_report)
It can be helpful to visualize how the classifier is performing. Here's how to visualize the
decision boundaries for a 2D feature subset (e.g., petal length vs. petal width).
import numpy as np
import matplotlib.pyplot as plt
Conclusion
The k-NN classifier is a simple, yet powerful algorithm for classification. By selecting an
appropriate value for k, it can perform well on various datasets. Visualizing the decision
boundaries helps understand how the classifier separates the classes based on feature values.
Viva Questions
To implement the Support Vector Machines (SVM) classification algorithm on a labeled dataset
and evaluate its performance.
Prerequisites
Python programming.
Libraries: pandas, numpy, sklearn, matplotlib, seaborn.
Dataset Description
Procedure
Step 1: Install Required Libraries
import pandas as pd
from sklearn.datasets import load_iris
# Convert to a DataFrame
data = pd.DataFrame(iris.data, columns=iris.feature_names)
data['target'] = iris.target
# View the first few rows
print(data.head())
Step 3: Split the Dataset into Training and Testing Sets
We will split the data into training and testing sets (80% for training, 20% for testing) for model
evaluation.
# Split the dataset into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Now, we'll initialize and train the SVM classifier using the training data.
After training the model, use it to predict the class labels on the test set.
# Predict on the test set
y_pred = svm.predict(X_test)
We will evaluate the model's performance using accuracy, confusion matrix, and classification
report.
# Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", conf_matrix)
# Classification Report
class_report = classification_report(y_test, y_pred)
print("Classification Report:\n", class_report)
Step 7: Visualize the SVM Classifier (Optional)
To better understand how the classifier performs, you can visualize the decision boundaries for a
2D feature subset (e.g., petal length vs. petal width).
import numpy as np
import matplotlib.pyplot as plt
Results
Conclusion
The SVM classifier is effective for classification tasks, particularly when there is a clear margin
of separation between classes. The linear kernel works well in simple cases, but other kernels
like 'rbf' or 'poly' can be tried for more complex data.
Viva Questions
To implement the Naïve Bayes classification algorithm on a labeled dataset and evaluate its
performance.
Prerequisites
Python programming.
Libraries: pandas, numpy, sklearn, matplotlib, seaborn.
Dataset Description
Procedure
Step 1: Install Required Libraries
import pandas as pd
from sklearn.datasets import load_iris
# Convert to a DataFrame
data = pd.DataFrame(iris.data, columns=iris.feature_names)
data['target'] = iris.target
# View the first few rows
print(data.head())
Step 3: Split the Dataset into Training and Testing Sets
We will split the data into training and testing sets (80% for training, 20% for testing) for model
evaluation.
# Split the dataset into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Now, we'll initialize and train the Naïve Bayes classifier using the training data.
After training the model, use it to predict the class labels on the test set.
We will evaluate the model's performance using accuracy, confusion matrix, and classification
report.
# Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", conf_matrix)
# Classification Report
class_report = classification_report(y_test, y_pred)
print("Classification Report:\n", class_report)
Conclusion
The Naïve Bayes classifier is effective for classification tasks, especially when the features are
conditionally independent. It's a simple yet powerful model for tasks like text classification,
medical diagnosis, and more.
Viva Questions
To implement a Linear Regression model to predict a continuous target variable and evaluate its
performance using metrics such as Mean Squared Error (MSE) and R-squared.
Prerequisites
Python programming.
Libraries: pandas, numpy, sklearn, matplotlib, seaborn.
Dataset Description
Dataset Used: Boston Housing dataset (commonly used for regression tasks).
Source: The dataset is available in the sklearn.datasets module.
Features:
o Various features of housing data, such as crime rate, average number of rooms,
property tax rate, etc.
o Target: Median value of owner-occupied homes (in thousands of dollars).
Procedure
Step 1: Install Required Libraries
import pandas as pd
from sklearn.datasets import load_boston
# Convert to a DataFrame
data = pd.DataFrame(boston.data, columns=boston.feature_names)
data['target'] = boston.target
# View the first few rows
print(data.head())
Step 3: Split the Dataset into Training and Testing Sets
We will split the data into training and testing sets (80% for training, 20% for testing) for model
evaluation.
# Split the dataset into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Now, we'll initialize and train the linear regression model using the training data.
After training the model, we use it to predict the target variable on the test set.
We will evaluate the model using performance metrics like Mean Squared Error (MSE), R-
squared, and the residual plot.
# R-squared value
r2 = r2_score(y_test, y_pred)
print("R-squared:", r2)
Visualizing the predictions vs actual values can provide more insight into the model's
performance.
# Plotting Actual vs Predicted values
plt.figure(figsize=(8, 6))
plt.scatter(y_test, y_pred)
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], color='red', linestyle='--')
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Actual vs Predicted Values')
plt.show()
Results
Mean Squared Error (MSE): The average of the squared differences between the
predicted and actual values. Lower values indicate better performance.
R-squared: A measure of how well the model explains the variance in the data. Values
closer to 1 indicate a better fit.
Residual Plot: A visualization of the residuals (errors). It helps detect patterns that might
indicate issues with the model.
Conclusion
The Linear Regression model has been successfully trained and evaluated on the Boston Housing
dataset. The evaluation metrics, such as R-squared and MSE, help assess the model’s accuracy.
Visualizations like residual and actual-vs-predicted plots can offer deeper insights into the
model’s performance.
Viva Questions
Prerequisites
Python programming.
Libraries: pandas, numpy, sklearn, matplotlib.
Dataset Description
We will use the Boston Housing dataset, which is often used for regression tasks.
Features: Various features of housing data, such as crime rate, average number of rooms,
property tax rate, etc.
Target: Median value of owner-occupied homes (in thousands of dollars).
Procedure
Step 1: Install Required Libraries
import pandas as pd
from sklearn.datasets import load_boston
# Convert to a DataFrame
data = pd.DataFrame(boston.data, columns=boston.feature_names)
data['target'] = boston.target
# View the first few rows
print(data.head())
We will split the data into training and testing sets (80% for training, 20% for testing) for model
evaluation.
# Split the dataset into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Now, we will fit the Polynomial Regression model to the transformed features.
Now that the model is trained, we use it to predict the target variable on the test set.
We will evaluate the model using performance metrics like Mean Squared Error (MSE), R-
squared, and visualize the results.
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt
# R-squared value
r2 = r2_score(y_test, y_pred)
print("R-squared:", r2)
We can also visualize how well the model fits the actual vs predicted values.
Results
Mean Squared Error (MSE): This is a measure of how close the predictions are to the
actual values. A lower value indicates better model performance.
R-squared: This value indicates how well the polynomial regression model explains the
variance in the data. Values closer to 1 are better.
Residual Plot: The plot of residuals helps assess how well the model's predictions match
the actual values.
Conclusion
The Polynomial Regression model has been successfully trained and evaluated on the Boston
Housing dataset. The evaluation metrics such as MSE and R-squared help assess the model's
performance, while the residual plot gives insight into the accuracy of the predictions.
Viva Questions
1. What is polynomial regression, and how does it differ from linear regression?
2. How do polynomial features help improve the performance of a regression model?
3. What is the impact of the degree of polynomial features on the model's performance?
4. Why is it important to evaluate the model using metrics like MSE and R-squared?