Machine_Learning_Lab
Machine_Learning_Lab
_____________________________________________________
Examiners:
1.
2.
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar
Page 2 of 45
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar
Page 3 of 45
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar
Table of Contents
Sl. Page
Title Date Remarks
No. No.
1 Introduction to Machine Learning 5
Univariate analysis, histogram, box plot,
2 8
bar graph and piechart using Iris dataset
Bivariate analysis, Pearson’s correlation
3 coefficient and heat map using Titanic 14
dataset
Implementation of Principal Component
4 18
Analysis (PCA)
Implementation of k-nearest neighbours
5 21
algorithm (kNN) with k=1,3,5
Implementation of non-parametric
6 Locally Weighted Linear Regression 25
algorithm
Implementation of Linear Regression
using Boston Housing dataset and
7 28
Polynomial Regression using Auto MPG
dataset
Implementation of Decision Tree
8 33
Classifier using Titanic dataset
Implementation of Naive Bayes
9 38
Classifier on Iris dataset
Implementation of K-means clustering
10 42
using Wisconsin Breast Cancer dataset
Page 4 of 45
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar
1. Data vs Information
Page 5 of 45
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar
Page 6 of 45
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar
1. Data Collection: Gather relevant data from sources like databases, APIs, or
CSV files.
2. Data Preprocessing: Handle missing values, remove duplicates, and
normalize data.
3. Exploratory Data Analysis (EDA): Use visualization and statistics to
understand data patterns.
4. Feature Engineering: Select, extract, or create new features to improve
model accuracy.
5. Model Selection: Choose an appropriate machine learning algorithm.
6. Model Training: Train the model using historical data.
7. Model Evaluation: Assess performance using metrics like accuracy,
precision, recall, and F1-score.
8. Model Deployment: Deploy the trained model for real-world predictions.
9. Model Monitoring & Improvement: Continuously refine the model based
on new data.
Page 7 of 45
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar
Experiment 1
Dataset:
The Iris dataset is readily available through the seaborn library, which provides
a convenient interface for loading and working with the data.
Code:
# Import necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
Page 8 of 45
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar
# Generate a histogram
plt.figure(figsize=(10, 5))
sns.histplot(data, kde=True, bins=10, color='skyblue')
plt.title(f'Histogram of {numerical_col}')
plt.xlabel(numerical_col)
plt.ylabel('Frequency')
plt.grid(True)
plt.show()
# Generate a boxplot
plt.figure(figsize=(8, 4))
sns.boxplot(x=data, color='lightgreen')
plt.title(f'Boxplot of {numerical_col}')
plt.xlabel(numerical_col)
plt.grid(True)
plt.show()
Page 9 of 45
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar
Page 10 of 45
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar
Notes:
• Dataset Loading: The script utilizes the load_dataset function from
seaborn to load the Iris dataset directly, eliminating the need for external file
handling.
• Statistical Computations: The code calculates essential descriptive
statistics for the selected numerical column (sepal_length).
• Visualizations: Histograms and boxplots are generated to visualize
the distribution of the numerical data, while bar and pie charts are used for the
categorical variable (species).
• Outlier Detection: The Interquartile Range (IQR) method is
employed to identify potential outliers in the numerical data.
Output:
Page 11 of 45
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar
Page 12 of 45
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar
Conclusion:
This script provides a comprehensive approach to performing descriptive
statistical analysis and visualization on a dataset, facilitating a deeper
understanding of the data’s characteristics.
Page 13 of 45
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar
Experiment 2
Dataset:
The Titanic dataset is readily available through the seaborn library, which
provides a convenient interface for loading and working with the data.
Code:
# Import necessary libraries
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
Page 14 of 45
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar
Note:
• Loads the Titanic dataset using sns.load_dataset("titanic").
• Filters two numerical columns (age and fare) and drops missing
values.
• Scatter plot shows the relationship between age and fare.
• Computes Pearson correlation coefficient using .corr().
• Calculates covariance and correlation matrix.
• Uses sns.heatmap() to visualize correlations in a heatmap.
Page 15 of 45
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar
Output:
Page 16 of 45
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar
Conclusion:
This experiment highlights the importance of statistical correlation analysis in
understanding variable relationships within a dataset. While age and fare in the
Titanic dataset do not show a strong correlation, similar techniques can be
applied to explore relationships in other datasets.
Page 17 of 45
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar
Experiment 3
Develop a program to implement Principal Component Analysis
(PCA) for reducing the dimensionality of the Iris dataset from 4
features to 2.
________________________________________________________________
Dataset:
The Iris dataset is readily available through the seaborn library, which provides
a convenient interface for loading and working with the data.
Code:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
Page 18 of 45
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar
Output:
Page 19 of 45
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar
Conclusion:
1. Dimensionality Reduction Success:
◦ We successfully reduced the 4-dimensional Iris dataset to 2
dimensions using PCA while retaining most of the information.
2. Variance Explained:
◦ The first two principal components together explain most of the
variance in the dataset.
◦ The explained variance ratio tells us how much information is
preserved after transformation.
3. Visualization of Clusters:
◦ The scatter plot shows that different species of flowers in the Iris
dataset form distinct clusters, confirming that PCA preserves class
separability.
Page 20 of 45
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar
Experiment 4
Dataset:
The Iris dataset is readily available through the seaborn library, which provides
a convenient interface for loading and working with the data.
Code:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, f1_score
Page 21 of 45
𝑘
𝑤
𝑒
𝑖
𝑔
𝑡
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar
# Split the dataset into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
# Make predictions
y_pred = knn.predict(X_test)
# Evaluate performance
accuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred, average='weighted')
return results
Page 22 of 45
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar
Output:
Page 23 of 45
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar
Conclusion:
1. Effect of k on Performance:
◦ For k=1, accuracy is generally high but may overfit.
◦ For k=3 and k=5, accuracy is more stable, reducing overfitting.
◦ Increasing k too much might reduce classification power.
2. Weighted vs. Regular k-NN:
◦ Weighted k-NN (distance-based weighting) often performs
better than regular k-NN, especially for datasets with uneven class
distributions.
◦ The weighting factor (1/d²) gives closer neighbours more
importance, improving classification when noise is present.
k-NN is a simple but effective algorithm for classification. The choice of k and
weighting impacts accuracy significantly. Weighted k-NN is recommended
when different classes have overlapping regions.
Page 24 of 45
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar
Experiment 6
Dataset:
The dataset in the program is a synthetic nonlinear dataset, where the input
feature is X. X consists of 100 evenly spaced points between -3 and 3. The target
variable is y. y follows a sinusoidal function with added Gaussian noise,
making it ideal for testing Locally Weighted Regression (LWR) on a non-
linear pattern.
Code:
# Import necessary libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
Page 25 of 45
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar
plt.figure(figsize=(10, 6))
Page 26 of 45
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar
Output:
Conclusion
1. Impact of τ (Bandwidth Parameter):
◦ Small τ (0.1) → Model is highly flexible (overfits, follows noise).
◦ Medium τ (0.5) → Balances flexibility and smoothness.
◦ Large τ (1.0) → Model becomes nearly linear (underfits).
2. Strengths of LWR:
◦ Works well for nonlinear datasets.
◦ No fixed global parameters—each prediction adapts to local
patterns.
3. Limitations:
◦ Computationally expensive (O(n³) per query point due to matrix
inversion).
◦ Not ideal for high-dimensional datasets due to sparsity issues.
Page 27 of 45
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar
Experiment 7
Datasets:
1. California Housing Dataset (Used for Linear Regression):
◦ This dataset contains information about housing prices in
California, with features such as the average number of rooms
per dwelling (RM) and median home value in different districts.
◦ We use Linear Regression to predict house prices based on the
average number of rooms per dwelling.
2. Auto MPG Dataset (Used for Polynomial Regression):
◦ This dataset provides fuel efficiency (miles per gallon) for cars
based on features like vehicle weight, horsepower, and
acceleration.
◦ We use Polynomial Regression (degree=2) to predict MPG (fuel
efficiency) based on vehicle weight, as the relationship is
nonlinear.
Code:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
Page 28 of 45
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar
# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X_boston, y_boston,
test_size=0.2, random_state=42)
# Predict
y_pred = lin_reg.predict(X_test)
Page 29 of 45
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar
# Data Preprocessing
df = df.dropna() # Remove rows with missing values
df["horsepower"] = df["horsepower"].astype(float) # Convert horsepower to
float
# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X_auto, y_auto, test_size=0.2,
random_state=42)
# Predict
y_pred_poly = poly_reg.predict(X_test_poly)
Page 30 of 45
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar
Output:
Page 31 of 45
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar
Conclusion
1. Linear Regression Results:
◦ The California Housing Dataset shows a strong linear
correlation between house prices and the number of rooms per
dwelling.
◦ The R² score suggests how well the model fits the data.
2. Polynomial Regression Results:
◦ The Auto MPG Dataset exhibits a nonlinear relationship between
vehicle weight and fuel efficiency.
◦ Polynomial Regression (degree=2) performs better than Linear
Regression for this dataset.
Page 32 of 45
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar
Experiment 8
Develop a program to load the Titanic dataset. Split the data into
training and test sets. Train a decision tree classifier. Visualize the
tree structure. Evaluate accuracy, precision, recall, and F1-score.
________________________________________________________
Dataset:
The Titanic dataset is readily available through the seaborn library, which
provides a convenient interface for loading and working with the data. The
Titanic dataset contains information about passengers on the Titanic ship,
including features like Passenger Class, Sex, Age, Fare, and Embarked
Location. The target variable is Survived (1 for survived, 0 for not survived).
Code:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import accuracy_score, precision_score, recall_score,
f1_score, classification_report
Page 33 of 45
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar
# Data Preprocessing
titanic.drop(["Name", "Ticket", "Cabin", "PassengerId"], axis=1, inplace=True)
titanic["Age"].fillna(titanic["Age"].median(), inplace=True)
titanic["Embarked"].fillna(titanic["Embarked"].mode()[0], inplace=True)
Page 34 of 45
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar
# Predictions
y_pred = best_tree.predict(X_test)
# Evaluation Metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")
# Classification Report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
Note:
In the above program, we have used Grid Search for Hyperparameter tuning
to find the best model. The best hyperparameters are selected based on the
highest accuracy score during cross-validation.The model is retrained using the
optimal hyperparameters to improve performance.
Output:
Page 35 of 45
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar
Page 36 of 45
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar
Conclusion:
1. The Decision Tree model provides an interpretable classification structure
for predicting passenger survival based on their features.
2. Grid Search helps in identifying the best combination of hyperparameters
to avoid underfitting or overfitting.
3. The accuracy, precision, recall, and F1-score give insights into how well
the model performs on the testing data. The optimized Decision Tree model
performs better with increased accuracy and F1-score.
4. The visualization shows how the decision tree makes decisions at each node,
making it easy to understand the model's logic.
5. The final model is more reliable and produces better classification results
on the Titanic dataset.
Page 37 of 45
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar
Experiment 9
Develop a program to implement the Naive Bayesian classifier
considering Iris dataset for training. Compute the accuracy of the
classifier, considering the test data.
________________________________________________________
Dataset:
The Iris dataset is readily available through the seaborn library, which provides
a convenient interface for loading and working with the data.
Code:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report,
confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt
# Convert to DataFrame
iris_df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
iris_df['species'] = iris.target
Page 38 of 45
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar
# Split dataset into training and testing sets (70% training, 30% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of Naive Bayesian Classifier: {accuracy:.4f}")
Page 39 of 45
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar
Output:
Page 40 of 45
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar
Conclusion:
The Naive Bayesian Classifier demonstrates good performance on the Iris
dataset due to its simplicity and assumption of feature independence. It is
particularly effective for small datasets and provides competitive accuracy for
this classification task.
Page 41 of 45
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar
Experiment 10
Develop a program to implement k-means clustering using
Wisconsin Breast Cancer data set and visualize the clustering
result.
________________________________________________________
Dataset:
The Wisconsin Breast Cancer dataset contains information on breast cancer
cases. It consists of 30 numerical features computed from digitized images of
breast mass, such as the mean, standard error, and worst (largest) measurements
for characteristics like radius, texture, perimeter, area, and smoothness. The
target variable 'diagnosis' indicates whether the cancer is malignant or benign.
Code:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
import seaborn as sns
# Load the Wisconsin Breast Cancer dataset from UCI Machine Learning
Repository
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-
wisconsin/wdbc.data'
columns = ['id', 'diagnosis'] + [f'feature_{i}' for i in range(1, 31)]
df = pd.read_csv(url, header=None, names=columns)
Page 42 of 45
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar
# Dataset Description
print("Dataset Description:\n")
print(df.describe())
# Preprocessing
df.drop(['id'], axis=1, inplace=True)
df['diagnosis'] = LabelEncoder().fit_transform(df['diagnosis'])
# Feature Scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
Page 43 of 45
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar
Output:
Page 44 of 45
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar
Conclusion:
The K-Means Clustering algorithm was applied to the Wisconsin Breast
Cancer dataset to group the data into two clusters. The clustering result provides
an unsupervised approach to categorize the data points based on feature
similarity. The Silhouette Score and Inertia metrics were used to evaluate the
clustering performance, indicating that the algorithm effectively separated the
data into meaningful clusters.
Page 45 of 45