ML Assignment
ML Assignment
Task 01
Import a classification dataset (e.g., Iris, Titanic, or Breast Cancer). Perform an exploratory data
analysis (EDA) to understand the features, target classes, and data distribution. Visualize key relationships
using scatter plots, bar charts, or pair plots.
# Summary statistics
print("\n\nSummary statistics of the dataset")
print(titanic_data.describe())
# Survival distribution
fig, ax = plt.subplots(1, 2, figsize=(12, 4))
# Survival distribution
sns.countplot(data=titanic_data, x='Survived', ax=ax[0])
ax[0].set_title('Survival Distribution')
1
Task 2:
Handling Missing Values in a Classification Dataset: Use a dataset with missing
values (e.g., Titanic dataset). Demonstrate different methods to handle them, such
as imputation (mean, mode) or deletion. Compare the impact of preprocessing on
model performance.
from sklearn.impute import SimpleImputer
2
# Impute missing 'Embarked' with mode (most frequent)
titanic_data['Embarked'] = imputer.fit_transform(titanic_data[['Embarked']]).ravel()
print('\nMissing Values After Handling:')
print(titanic_data.isnull().sum())
Task 3:
Scaling and Normalization: Apply standardization (z-score) or
normalization (Min-Max scaling) to numeric features of a dataset (e.g., Wine
dataset). Train a classification model before and after scaling, and evaluate the
effect on accuracy.
from sklearn.preprocessing import StandardScaler, MinMaxScaler
# Standardization (Z-score)
scaler = StandardScaler()
titanic_data_standardized = titanic_data.copy()
titanic_data_standardized[numerical_cols] = scaler.fit_transform(titanic_data[numerical_cols])
# Display examples
print("Standardized Data (First 5 Rows):")
print(titanic_data_standardized[numerical_cols].head())
3
print(titanic_data_normalized[numerical_cols].head())
Task 4:
Handling Imbalanced Datasets: Import an imbalanced dataset (e.g., Credit
Card Fraud Detection). Apply techniques such as oversampling (SMOTE),
undersampling. Train a classifier and evaluate its performance using metrics
like F1-score.
plt.tight_layout()
plt.show()
4
Task 5:
Encoding Categorical Variables: Import a dataset with categorical features
(e.g., Titanic or Heart Disease). Apply label encoding and one-hot encoding,
then train a classifier to compare the effect of encoding techniques on
performance.
5
Task 6:
Feature Transformation: Apply feature transformation techniques like PCA to a
classification dataset. Train a classifier and compare the performance with and
without feature transformation.
# Define numerical_features_temp
numerical_features_temp = titanic_data_temp.select_dtypes(include=['float64', 'int64']).columns
# Exclude 'Survived' from the features
features_for_pca = numerical_features_temp.drop('Survived')
6
# Assign PCA components to new columns
pca_columns = [f'PC{i+1}' for i in range(pca.n_components_)]
titanic_data_pca_temp[pca_columns] = pca_components
clf_original = RandomForestClassifier(random_state=42)
clf_original.fit(X_train, y_train)
y_pred_original = clf_original.predict(X_test)
clf_pca = RandomForestClassifier(random_state=42)
clf_pca.fit(X_train_pca, y_train_pca)
y_pred_pca = clf_pca.predict(X_test_pca)
7
Task 7:
Train a Binary Classification Model: Use a dataset like Titanic or Heart
Disease to train a binary classifier (e.g., Logistic Regression or Random
Forest). Evaluate the model using accuracy, precision, recall, and F1-score.
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Evaluation
print("Random Forest Accuracy:", accuracy_score(y_test, y_pred))
8
Task 8:
Train a Multi-Class Classification Model: Use a dataset like Iris or MNIST
to train a multi-class classification model (e.g., Decision Tree or SVM).
Evaluate the model using metrics like accuracy and per-class F1-scores.
iris = load_iris()
X_iris = iris.data
y_iris = iris.target
# Split data
X_train_iris, X_test_iris, y_train_iris, y_test_iris = train_test_split(X_iris, y_iris, test_size=0.2,
random_state=42)
# Predictions
y_pred_iris = dt_model.predict(X_test_iris)
# Evaluation
print("Decision Tree Accuracy on Iris:", accuracy_score(y_test_iris, y_pred_iris))
Task 9:
Evaluation Metrics and Cross-Validation:
a. Use a dataset to train a classifier and evaluate it with k-fold cross-
validation. Report metrics such as:
i. Confusion Matrix
ii. Accuracy, Precision, Recall, F1-Score
iii. ROC-AUC Score
# Classification report
print("Classification Report:")
print(classification_report(y_test, y_pred))
# Cross-validation
cv_scores = cross_val_score(rf_model, X, y, cv=5)
9
print("Cross-Validation Scores:", cv_scores)
print("Mean CV Score:", cv_scores.mean())
Task 10:
Comparing Models: Train multiple classifiers (e.g., Logistic Regression, Decision
Tree, k-NN, SVM, Naïve Bayes and Random Forest) on the same dataset and
compare their evaluation metrics.
models = {
'Random Forest': RandomForestClassifier(random_state=42),
'Logistic Regression': LogisticRegression(),
'SVM': SVC(),
'Naive Bayes': GaussianNB(),
'K-Nearest Neighbors': KNeighborsClassifier()
}
results = {}
for name, model in models.items():
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
results[name] = accuracy_score(y_test, y_pred)
print("Model Comparison:")
print(results)
Model Comparison:
{'Random Forest': 0.8435754189944135, 'Logistic Regression': 0.770949720670391, 'SVM':
0.6815642458100558, 'Naive Bayes': 0.7988826815642458, 'K-Nearest Neighbors':
0.6703910614525139}
10
Assignment-02
Assignment Name: Problem solving on Clustering
Task 01:
Dataset Import and Initial Analysis
a. Load a clustering dataset (e.g., Iris, Mall Customers, or Wine dataset). Perform
an exploratory data analysis (EDA) to understand the features and the
distribution of data.
b. Visualize the data distribution using pair plots, histograms, and scatter plots.
c. Investigate any potential outliers or missing values.
# Load the dataset
iris = pd.read_csv('iris.csv')
11
sns.scatterplot(x='sepal.length', y='sepal.width', hue='variety', data=iris)
plt.title('Sepal Length vs Sepal Width')
plt.show()
12
13
# Check for outliers using boxplots
for column in iris.select_dtypes(include=np.number).columns:
sns.boxplot(x=iris[column])
plt.title(f"Boxplot of {column}")
plt.show()
14
Task 02:
Handling Missing Values: Import a dataset with missing values. Demonstrate how
to handle missing data using imputation methods (e.g., mean, median, or mode
imputation) and evaluate how this affects the clustering results.
iris.isnull().sum()
Task 03:
Scaling and Normalization: Import a dataset with features that have different scales
(e.g., Wine dataset). Perform feature scaling using techniques like Standardization
(z-score) or Min-Max Normalization. Evaluate how this affects the clustering results
15
like k-means.
# Standardization (z-score)
scaler_standard = StandardScaler()
iris_standard = iris.copy()
iris_standard.iloc[:, :-1] = scaler_standard.fit_transform(iris_standard.iloc[:, :-1])
# Min-Max Normalization
scaler_minmax = MinMaxScaler()
iris_minmax = iris.copy()
iris_minmax.iloc[:, :-1] = scaler_minmax.fit_transform(iris_minmax.iloc[:, :-1])
kmeans_minmax = KMeans(n_clusters=4)
iris_minmax['cluster'] = kmeans_minmax.fit_predict(iris_minmax.iloc[:, :-1])
plt.show()
16
Task 04:
Dealing with Categorical Data: Use a dataset with categorical features (e.g., Mall
Customers dataset). Perform encoding techniques (e.g., one-hot encoding) to convert
categorical variables into numerical values. Train a clustering model and compare the
impact of these transformations on clustering results.
Task 05:
Feature Selection and Dimensionality Reduction: Apply Principal Component
Analysis (PCA) to reduce dimensionality in a high-dimensional clustering dataset
(e.g., Wine dataset). Visualize the clustering results in 2D/3D using the reduced
features and compare the performance before and after dimensionality reduction.
# Apply PCA to reduce dimensionality to 2 components
pca = PCA(n_components=2)
iris_pca = iris.copy()
iris_pca_transformed = pca.fit_transform(iris_pca.iloc[:, :-1])
iris_pca = pd.DataFrame(iris_pca_transformed, columns=['PC1', 'PC2'])
iris_pca['variety'] = iris['variety']
17
plt.ylabel('Principal Component 2')
plt.show()
Task 06:
K-means Clustering:
a. Apply K-means clustering on a dataset (e.g., Iris or Mall Customers).
b. Use the Elbow Method to determine the optimal number of clusters.
c. Visualize the resulting clusters and analyze the cluster centroids.
18
plt.plot(range(1, 11), wcss, marker='o', linestyle='--')
plt.title('Elbow Method for Optimal Number of Clusters')
plt.xlabel('Number of Clusters')
plt.ylabel('WCSS')
plt.show()
# From the Elbow Method plot, choose the optimal number of clusters (e.g., 3)
optimal_clusters = 3
kmeans_optimal = KMeans(n_clusters=optimal_clusters, random_state=42)
iris['cluster'] = kmeans_optimal.fit_predict(iris.iloc[:, :-1])
19
Task 07
Hierarchical Clustering:
a. Apply Agglomerative Hierarchical Clustering to a dataset (e.g., Wine or
Iris).
b. Use a Dendrogram to visualize the hierarchy and determine the appropriate
number of clusters.
# Determine the appropriate number of clusters from the dendrogram and apply Agglomerative Clustering
n_clusters = 3 # You can change this based on the dendrogram
agg_clustering = AgglomerativeClustering(n_clusters=n_clusters)
iris['agg_cluster'] = agg_clustering.fit_predict(iris.iloc[:, :-1])
20
Task 08
Evaluation of Clustering Results: After applying clustering algorithms (e.g., K-
means or Hierarchical Clustering), evaluate the clustering results using metrics like:
a. Silhouette Score: To measure how similar points within a cluster are,
compared to points in other clusters.
b. Inertia (within-cluster sum of squares): To assess how compact the clusters
are.
c. Adjusted Rand Index (ARI): To evaluate the similarity between the clusters
and the true labels (if available).
d. Davies-Bouldin Index: To evaluate the average similarity ratio of each cluster
with other clusters.
# Silhouette Score
silhouette_kmeans = silhouette_score(iris.iloc[:, :-2], iris['cluster'])
silhouette_agg = silhouette_score(iris.iloc[:, :-2], iris['agg_cluster'])
# Davies-Bouldin Index
dbi_kmeans = davies_bouldin_score(iris.iloc[:, :-2], iris['cluster'])
dbi_agg = davies_bouldin_score(iris.iloc[:, :-2], iris['agg_cluster'])
21
# Inertia (within-cluster sum of squares) for KMeans
inertia_kmeans = kmeans_optimal.inertia_
22
Assignemnet-03
Assignment Name: Problem solving on Regression
Task 01
Data Loading:
• Import a dataset of your choice (e.g., California Housing, Boston Housing, or custom CSV
• data).
• Identify the target and predictor variables.
• Check for missing values, data types, and outliers.
import pandas as pd
import numpy as np
# Load the dataset
data = pd.read_csv('housing.csv')
Task 02
Handling Missing Values: Import a dataset with missing values. Demonstrate methods to
handle them (e.g., mean/mode/median imputation, dropping rows/columns) and analyze
how this affects the regression model’s performance.
23
# Check for missing values
missing_values = data.isnull().sum()
print("Missing values before handling:\n", missing_values)
24
Task 03
Scaling and Normalization: Import a dataset with features on different scales. Perform
normalization (e.g., Min-Max Scaling) and standardization (e.g., z-score). Compare the
impact of these preprocessing steps on the performance of a linear regression model.
# Standard Scaling
standard_scaler = StandardScaler()
data_standard_scaled = data.copy()
data_standard_scaled[numeric_cols] = standard_scaler.fit_transform(data[numeric_cols])
25
Task 04
Categorical Data Encoding: Use a dataset with categorical features. Demonstrate how to
apply label encoding and one hot encoding. Train a regression model and compare results
using these encoding techniques.
# Label Encoding
from sklearn.preprocessing import LabelEncoder
data_label_encoded = data.copy()
for col in categorical_cols:
le = LabelEncoder()
data_label_encoded[col] = le.fit_transform(data[col])
# One-Hot Encoding
data_one_hot_encoded = pd.get_dummies(data, columns=categorical_cols)
26
Task 05
Feature Selection using LASSO Regression: Use a given dataset to perform regression using
LASSO (Least Absolute Shrinkage and Selection Operator). Identify the features selected by
LASSO, train the regression model, and evaluate its performance using metrics like R², RMSE,
and MAE.
# Apply LASSO
lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)
# Features selected
selected_features = X_train.columns[lasso.coef_ != 0]
print("Features selected by LASSO:\n", selected_features)
Task 06
Correlation-Based Feature Selection: Perform feature selection based on correlation
analysis. Remove highly correlated features (e.g., correlation > 0.85) and train a regression
model on the remaining features. Evaluate the model using R², RMSE, MAE, and compare its
performance with a model trained on the original dataset.
27
# Compute correlation matrix
correlation_matrix = data.corr().abs()
# Apply LASSO
lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)
Task 07
Regularization Techniques: Ridge and Lasso : Train Ridge and Lasso regression models on a
dataset with multicollinearity (e.g., diabetes dataset). Compare the regularization effects by
varying hyperparameters (α or λ) and observing their impact on model performance and
feature coefficients.
28
# Load the diabetes dataset
diabetes = load_diabetes()
X, y = diabetes.data, diabetes.target
ridge.fit(X_train, y_train)
lasso.fit(X_train, y_train)
ridge_coefs.append(ridge.coef_)
lasso_coefs.append(lasso.coef_)
ridge_pred = ridge.predict(X_test)
lasso_pred = lasso.predict(X_test)
ridge_errors.append(mean_squared_error(y_test, ridge_pred))
lasso_errors.append(mean_squared_error(y_test, lasso_pred))
plt.subplot(1, 2, 1)
plt.plot(alphas, ridge_coefs)
plt.xscale('log')
plt.xlabel('Alpha')
plt.ylabel('Coefficients')
plt.title('Ridge Coefficients as a function of the regularization')
plt.legend(diabetes.feature_names, loc='best')
plt.subplot(1, 2, 2)
plt.plot(alphas, lasso_coefs)
plt.xscale('log')
plt.xlabel('Alpha')
plt.ylabel('Coefficients')
plt.title('Lasso Coefficients as a function of the regularization')
29
plt.legend(diabetes.feature_names, loc='best')
plt.show()
30
Task 08
Model Evaluation with Cross-Validation: Use k-fold cross-validation to train and evaluate a
regression model. Report metrics such as MSE, R², MAE, and RMSE for each fold. Compare
performance with and without cross-validation.
# Perform cross-validation
mse_scores = cross_val_score(lasso, X, y, cv=kf, scoring=mse_scorer)
r2_scores = cross_val_score(lasso, X, y, cv=kf, scoring=r2_scorer)
mae_scores = cross_val_score(lasso, X, y, cv=kf, scoring=mae_scorer)
rmse_scores = np.sqrt(mse_scores)
31
Task 09
Regression Model Comparison: Train multiple regression models (e.g., Linear Regression,
Ridge, and Lasso) on the same dataset. Compare them based on evaluation metrics such as
MAE, MSE, RMSE, and R². Discuss the scenarios where each model performs better.
# Make predictions
y_pred_linear = linear_reg.predict(X_test)
y_pred_ridge = ridge_reg.predict(X_test)
y_pred_lasso = lasso_reg.predict(X_test)
32
# Print the evaluation metrics
print("Linear Regression Performance:")
print(f"MSE: {metrics_linear[0]}, MAE: {metrics_linear[1]}, RMSE: {metrics_linear[2]}, R²:
{metrics_linear[3]}")
33
Discussion of Scenarios:
- Linear Regression:
- Performs better when the relationship between the independent and dependent variables is
approximately linear and there is no multicollinearity in the data.
- Suitable for datasets with a large number of features where feature selection is not critical.
- Ridge Regression:
- Performs better in scenarios where multicollinearity is present among the features.
- Useful when you want to retain all features but reduce their impact by shrinking the coefficients.
- Suitable for datasets where overfitting is a concern and you want to regularize the model.
- Lasso Regression:
- Performs better when feature selection is important, as it can shrink some coefficients to zero,
effectively performing variable selection.
- Suitable for datasets with many features, especially when some of them are not important.
- Useful when you want a simpler model that is easier to interpret.
Each model has its strengths and is suitable for different scenarios based on the nature of the data and the
specific requirements of the analysis.
34