Practical Labs Guide

Download as pdf or txt
Download as pdf or txt
You are on page 1of 34

Practical Lab for Machine

Learning
Day 1
Lab: Setting Up Python Environment for Machine Learning
Objective:
Understand how to create and configure a Python environment for machine learning
projects, including the installation of essential libraries such as NumPy, Pandas,
Matplotlib, Seaborn, Scikit-learn, TensorFlow, and Keras.
Step 1: Installing Python
1. Ensure that Python is installed on your system by running the following command:
python --version
If Python is not installed, download and install it from the official Python website.
Step 2: Creating a Virtual Environment
A virtual environment is a self-contained directory where Python and its libraries are
installed, which helps to manage dependencies for different projects.

1. Navigate to your project folder where you'd like to create the virtual environment:

cd /path/to/your/project
2. Create a virtual environment named ml_env:
● For Linux/macOS
python3 -m venv ml_env
● For Windows users
python -m venv ml_env
This command will create a folder named ml_env that contains the environment.
Step 3: Activating the Virtual Environment
1. To activate the virtual environment:
● For Linux/macOS
source ml_env/bin/activate
● For Windows users
ml_env\Scripts\activate
2. After activation, your terminal prompt should show the virtual environment name
(ml_env) indicating it’s activated.
Step 4: Installing Required Libraries
With the virtual environment activated, you can now install the necessary libraries for
machine learning.

3. Install the required libraries:

pip install numpy pandas matplotlib seaborn scikit-learn tensorflow keras

This command installs:

● numpy: Library for numerical operations.


● pandas: Library for data manipulation.
● matplotlib & seaborn: Libraries for data visualization.
● scikit-learn: Library for machine learning algorithms.
● tensorflow & keras: Libraries for deep learning.

Step 5: Verifying Installation


1. Open a Python interpreter by typing python in your terminal:

python

2. Import the installed libraries to ensure they work:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
import tensorflow as tf
from tensorflow import keras
3. If there are no errors, your environment is set up successfully.
Step 6: Deactivating the Virtual Environment
Once you're done working in the environment, you can deactivate it by typing:

deactivate
Practical Exercise 1 : Test Installation
Objective:
Generate and visualize some random data using NumPy, Pandas, and Matplotlib.
Step 1: Import Required Libraries
Before generating data or plotting, you'll need to import the necessary libraries: NumPy
(for random number generation), Pandas (for organizing data), and Matplotlib (for
plotting).
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Step 2: Create a Random Dataset


Use NumPy to create a dataset of 100 rows and 4 columns. Each entry will be a random
number drawn from a normal distribution.
# Generate random data using NumPy
data = np.random.randn(100, 4)

# Convert the NumPy array into a Pandas DataFrame


df = pd.DataFrame(data, columns=['A', 'B', 'C', 'D'])
Here, we have four columns (A, B, C, D), and each column contains 100 random
values.
Step 3: Plot the Data
We will plot the data using Pandas' built-in plotting, which leverages Matplotlib under
the hood.
# Plot the data
df.plot(kind='line')
# Display the plot
plt.show()

Step 4: Understand the Output


Once you run the code, a plot should appear showing four lines (one for each column)
across 100 data points, allowing you to visualize the random data generated.
Day 2
Practical Exercise 2: Simple Machine Learning Example Using
Scikit-learn
Step 1: Import Required Libraries
For this exercise, we'll use Scikit-learn to load a dataset, split it into training and test
sets, train a random forest classifier, and evaluate its accuracy.
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

Step 2: Load the Iris Dataset


The Iris dataset is a classic dataset containing data on different species of iris flowers. It
includes 150 samples with 4 features each.
# Load the Iris dataset
iris = load_iris()
# Separate features and target variables
X = iris.data # Features (petal length, width, etc.)
y = iris.target # Labels (species)

Step 3: Split the Dataset into Training and Test Sets


We’ll split the dataset into two parts: 70% for training and 30% for testing. This helps to
evaluate the model on unseen data.
# Split the data into training and test sets (70% train, 30% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)

Step 4: Train a Random Forest Classifier


A random forest classifier is an ensemble method that fits multiple decision trees on
different parts of the dataset. Here, we’ll train it on the Iris dataset.
# Create and train a RandomForestClassifier
clf = RandomForestClassifier()
# Fit the model on the training data
clf.fit(X_train, y_train)

Step 5: Make Predictions on the Test Set


After training the model, we can use it to predict the species of iris flowers in the test
set.
# Predict labels for the test set
y_pred = clf.predict(X_test)

Step 6: Evaluate the Model’s Performance


To evaluate the model, we'll use the accuracy score, which tells us the proportion of
correct predictions.
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Step 3: Understand the Output


The output will be the accuracy score of the model, which represents the percentage of
correct predictions made on the test set.
Accuracy: 1.0
Summary of Steps
4. Exercise 1: Generate and Visualize Data
a. Import NumPy, Pandas, and Matplotlib.
b. Generate random data using NumPy.
c. Convert it to a Pandas DataFrame.
d. Plot the data using Matplotlib.
5. Exercise 2: Simple Machine Learning Example
a. Import Scikit-learn libraries for dataset, model, and metrics.
b. Load the Iris dataset.
c. Split the data into training and test sets.
d. Train a RandomForestClassifier.
e. Make predictions on the test set.
f. Evaluate the model using accuracy score.
Practical Exercise 3: Processing and Analyzing Air Quality Dataset
Objective:
This lab will walk through all the stages of data processing using your Dataset.csv for
air quality data.

Dataset Columns:

● NAME (Location)
● HUMIDITY, LIGHT, NO_MAX, NO_MIN, NO2_MAX, NO2_MIN, etc.
● SOUND, TEMPERATURE, UV, AIR_PRESSURE, Lattitude, Longitude
● LASTUPDATEDATETIME (Timestamp)

Data Collection Practical Exercise


Goal: Load the air quality dataset and inspect its structure.
import pandas as pd

# Load the dataset


df = pd.read_csv('Dataset.csv')

# Inspect the first few rows


print(df.head())

# Check the data types and non-null values


print(df.info())
Replace 'Dataset.csv' with the path to your dataset.
Practical Exercise 4: Data Cleaning
Objective:
Goal: Clean the data by handling missing values and ensuring correct data types.
Steps for Data Cleaning:
6. Handle missing values:
a. Identify missing values.
b. Drop or fill missing data if necessary.
7. Convert the timestamp column:
a. Convert LASTUPDATEDATETIME to the correct datetime format.

# Check for missing values


print(df.isnull().sum())

# Fill missing values with the mean for numeric columns only
numeric_cols = df.select_dtypes(include=['number']).columns
df[numeric_cols] = df[numeric_cols].fillna(df[numeric_cols].mean())

# Convert 'LASTUPDATEDATETIME' to datetime format


df['LASTUPDATEDATETIME'] = pd.to_datetime(df['LASTUPDATEDATETIME'],
format='%d/%m/%y %H:%M')

# Confirm data cleaning


print(df.info())
Replace 'LASTUPDATEDATETIME' with relevant column names from your dataset.
Practical Exercise 5: Data Transformation
Objective:
Goal: Transform data by scaling numerical columns and encoding categorical
variables.
Steps for Data Cleaning:
8. Scaling: Normalize numerical columns like HUMIDITY, LIGHT, etc.
9. Encoding: If necessary, encode categorical variables (e.g., NAME) using One-Hot
Encoding.

from sklearn.preprocessing import StandardScaler

# Select numeric columns to scale


numerical_columns = ['HUMIDITY', 'LIGHT', 'NO_MAX', 'NO_MIN',
'NO2_MAX', 'NO2_MIN','OZONE_MAX', 'OZONE_MIN', 'PM10_MAX',
'PM10_MIN', 'SOUND',
'TEMPRATURE_MAX', 'TEMPRATURE_MIN', 'UV_MAX', 'UV_MIN',
'AIR_PRESSURE']

# Scale the numerical columns


scaler = StandardScaler()
df[numerical_columns] = scaler.fit_transform(df[numerical_columns])

# Check the transformation


print(df.head())

Practical Exercise 6: Exploratory Data Analysis (EDA)


Objective:
Goal: Perform EDA to understand the data structure and relationships between
variables.
Steps for Data Cleaning:
10. Descriptive statistics:
a. Calculate basic statistics for all numerical columns (mean, median, standard
deviation).
11. Correlation:
a. Analyze correlation between air pollutants and temperature, humidity, etc.

# Descriptive statistics
print(df.describe())

# Correlation between numerical columns


import numpy as np
numeric_df = df.select_dtypes(include=[np.number])
correlation_matrix = numeric_df.corr()
print(correlation_matrix)

Practical Exercise 7: Visualizing Datasets


Objective:
Goal: Visualize the dataset to identify trends and patterns.
Steps for Data Cleaning:
12. Histogram: Visualize the distribution of key air pollutants like NO2_MAX, PM10_MAX,
etc.
13. Scatter plot: Explore relationships, such as between HUMIDITY and NO_MAX.
14. Heatmap: Visualize the correlation between air pollutants and weather conditions.

import matplotlib.pyplot as plt


import seaborn as sns

# Histogram for NO2_MAX


sns.histplot(df['NO2_MAX'], kde=True)
plt.title('Distribution of NO2_MAX')
plt.show()

# Scatter plot for HUMIDITY vs NO_MAX


plt.scatter(df['HUMIDITY'], df['NO_MAX'])
plt.xlabel('Humidity')
plt.ylabel('NO_MAX')
plt.title('Humidity vs NO_MAX')
plt.show()

# Heatmap for correlation


plt.figure(figsize=(10, 8))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()

Practical Exercise 8: Descriptive Statistics


Objective:
Goal: Calculate and interpret the key statistical measures of the dataset.
Steps for Data Cleaning:
1. Mean: Average values of pollutants and weather data.
2. Median: Middle value to understand the data distribution.
3. Standard Deviation: Measure variability.

# Mean of key variables


mean_values = df[['NO_MAX', 'NO2_MAX', 'PM10_MAX', 'HUMIDITY',
'TEMPRATURE_MAX']].mean()
print("Mean Values:\n", mean_values)

# Median of key variables


median_values = df[['NO_MAX', 'NO2_MAX', 'PM10_MAX', 'HUMIDITY',
'TEMPRATURE_MAX']].median()
print("Median Values:\n", median_values)

# Standard Deviation of key variables


std_values = df[['NO_MAX', 'NO2_MAX', 'PM10_MAX', 'HUMIDITY',
'TEMPRATURE_MAX']].std()
print("Standard Deviation:\n", std_values)

Understand the Output


By the end of this exercise 3 to 8, you will have:

● Processed real-world air quality data.


● Performed cleaning, transformation, and visualization of key features.
● Explored the relationships between environmental conditions and pollutant levels
using descriptive statistics and correlation analysis.

Outputs
##############################################
Output of Exercise 3
##############################################
##############################################
Inspect the first few rows
##############################################
NAME HUMIDITY LIGHT NO_MAX NO_MIN ...
UV_MIN AIR_PRESSURE LASTUPDATEDATETIME Lattitude Longitude0
BopadiSquare_65 19.995 3762.914 0 0 ... 0.2
0.933 13/05/19 12:16 18.559427 73.8286561 Karve Statue Square_5
20.730 529.245 0 0 ... 0.1 0.930 13/05/19
12:16 18.501727 73.8135952 Lullanagar_Square_14 17.387 693.375
0 0 ... 0.2 0.926 13/05/19 12:16 18.487306
73.8856503 Hadapsar_Gadital_01 18.725 723.631 0 0
... 0.1 0.930 13/05/19 12:16 18.501834 73.9414784
PMPML_Bus_Depot_Deccan_15 20.622 816.476 0 0 ... NaN
0.932 13/05/19 12:16 18.451716 73.856170
[5 rows x 28 columns]
##############################################
Check the data types and non-null values
##############################################
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14999 entries, 0 to 14998
Data columns (total 28 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 NAME 14999 non-null object
1 HUMIDITY 14805 non-null float64
2 LIGHT 14326 non-null float64
3 NO_MAX 14999 non-null int64
4 NO_MIN 14999 non-null int64
5 NO2_MAX 14973 non-null float64
6 NO2_MIN 14973 non-null float64
7 OZONE_MAX 14973 non-null float64
8 OZONE_MIN 14973 non-null float64
9 PM10_MAX 14646 non-null float64
10 PM10_MIN 14646 non-null float64
11 PM2_MAX 14646 non-null float64
12 PM2_MIN 14646 non-null float64
13 SO2_MAX 14973 non-null float64
14 SO2_MIN 14973 non-null float64
15 CO_MAX 14973 non-null float64
16 CO_MIN 14973 non-null float64
17 CO2_MAX 14963 non-null float64
18 CO2_MIN 14963 non-null float64
19 SOUND 14805 non-null float64
20 TEMPRATURE_MAX 14963 non-null float64
21 TEMPRATURE_MIN 14963 non-null float64
22 UV_MAX 14116 non-null float64
23 UV_MIN 14116 non-null float64
24 AIR_PRESSURE 14804 non-null float64
25 LASTUPDATEDATETIME 14999 non-null object
26 Lattitude 14999 non-null float64
27 Longitude 14999 non-null float64
dtypes: float64(24), int64(2), object(2)
memory usage: 3.2+ MB
None
##############################################
Output of Exercise 4
##############################################
##############################################
Check for missing values
##############################################
NAME 0
HUMIDITY 194
LIGHT 673
NO_MAX 0
NO_MIN 0
NO2_MAX 26
NO2_MIN 26
OZONE_MAX 26
OZONE_MIN 26
PM10_MAX 353
PM10_MIN 353
PM2_MAX 353
PM2_MIN 353
SO2_MAX 26
SO2_MIN 26
CO_MAX 26
CO_MIN 26
CO2_MAX 36
CO2_MIN 36
SOUND 194
TEMPRATURE_MAX 36
TEMPRATURE_MIN 36
UV_MAX 883
UV_MIN 883
AIR_PRESSURE 195
LASTUPDATEDATETIME 0
Lattitude 0
Longitude 0
dtype: int64
##############################################
Confirm data cleaning
##############################################
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14999 entries, 0 to 14998
Data columns (total 28 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 NAME 14999 non-null object
1 HUMIDITY 14999 non-null float64
2 LIGHT 14999 non-null float64
3 NO_MAX 14999 non-null int64
4 NO_MIN 14999 non-null int64
5 NO2_MAX 14999 non-null float64
6 NO2_MIN 14999 non-null float64
7 OZONE_MAX 14999 non-null float64
8 OZONE_MIN 14999 non-null float64
9 PM10_MAX 14999 non-null float64
10 PM10_MIN 14999 non-null float64
11 PM2_MAX 14999 non-null float64
12 PM2_MIN 14999 non-null float64
13 SO2_MAX 14999 non-null float64
14 SO2_MIN 14999 non-null float64
15 CO_MAX 14999 non-null float64
16 CO_MIN 14999 non-null float64
17 CO2_MAX 14999 non-null float64
18 CO2_MIN 14999 non-null float64
19 SOUND 14999 non-null float64
20 TEMPRATURE_MAX 14999 non-null float64
21 TEMPRATURE_MIN 14999 non-null float64
22 UV_MAX 14999 non-null float64
23 UV_MIN 14999 non-null float64
24 AIR_PRESSURE 14999 non-null float64
25 LASTUPDATEDATETIME 14999 non-null datetime64[ns]
26 Lattitude 14999 non-null float64
27 Longitude 14999 non-null float64
dtypes: datetime64[ns](1), float64(24), int64(2), object(1)
memory usage: 3.2+ MB
None
##############################################
Output of Exercise 5
##############################################
##############################################
Check the transformation
##############################################
NAME HUMIDITY LIGHT NO_MAX NO_MIN ...
UV_MIN AIR_PRESSURE LASTUPDATEDATETIME Lattitude Longitude
0 BopadiSquare_65 -1.085440 0.185820 0.0 0.0 ...
0.984757 0.072996 2019-05-13 12:16:00 18.559427 73.828656
1 Karve Statue Square_5 -1.039402 -0.312201 0.0 0.0 ...
0.055806 -0.302701 2019-05-13 12:16:00 18.501727 73.813595
2 Lullanagar_Square_14 -1.248798 -0.286923 0.0 0.0 ...
0.984757 -0.803631 2019-05-13 12:16:00 18.487306 73.885650
3 Hadapsar_Gadital_01 -1.164989 -0.282264 0.0 0.0 ...
0.055806 -0.302701 2019-05-13 12:16:00 18.501834 73.941478
4 PMPML_Bus_Depot_Deccan_15 -1.046166 -0.267964 0.0 0.0 ...
0.000000 -0.052237 2019-05-13 12:16:00 18.451716 73.856170

[5 rows x 28 columns]
##############################################
Output of Exercise 6
##############################################
##############################################
Descriptive statistics
##############################################
HUMIDITY LIGHT NO_MAX NO_MIN ... AIR_PRESSURE
LASTUPDATEDATETIME Lattitude Longitude
count 1.499900e+04 1.499900e+04 14999.0 14999.0 ... 1.499900e+04
14999 14999.000000 14999.000000
mean 1.136944e-16 5.684721e-18 0.0 0.0 ... -6.897461e-15
2019-04-18 03:16:54.379625472 18.504770 73.849372
min -1.701101e+00 -3.936832e-01 0.0 0.0 ... -1.194931e+01
2019-04-08 00:01:00 18.451716 73.792927
25% -9.244623e-01 -3.932329e-01 0.0 0.0 ... -1.774689e-01
2019-04-12 20:21:30 18.487306 73.824393
50% 4.450639e-16 -3.364439e-01 0.0 0.0 ... 7.299581e-02
2019-04-17 22:20:00 18.501834 73.828656
75% 7.363311e-01 0.000000e+00 0.0 0.0 ... 3.234605e-01
2019-04-23 03:58:00 18.525066 73.858092
max 2.709809e+00 8.622138e+00 0.0 0.0 ... 9.496222e-01
2019-05-13 12:46:00 18.559427 73.941478
std 1.000033e+00 1.000033e+00 0.0 0.0 ... 1.000033e+00
NaN 0.028060 0.042748

[8 rows x 27 columns]
##############################################
Correlation between numerical columns
##############################################
HUMIDITY LIGHT NO_MAX NO_MIN NO2_MAX ... UV_MAX
UV_MIN AIR_PRESSURE Lattitude Longitude
HUMIDITY 1.000000 -0.197010 NaN NaN -0.044097 ... -0.455358
-0.448803 0.022563 -0.023711 -0.036286
LIGHT -0.197010 1.000000 NaN NaN -0.178748 ... 0.183274
0.201387 0.028497 0.101492 -0.048619
NO_MAX NaN NaN NaN NaN NaN ... NaN
NaN NaN NaN NaN
NO_MIN NaN NaN NaN NaN NaN ... NaN
NaN NaN NaN NaN
NO2_MAX -0.044097 -0.178748 NaN NaN 1.000000 ... -0.293246
-0.049395 0.018118 -0.209901 0.472119
NO2_MIN -0.066931 -0.169840 NaN NaN 0.845805 ... -0.235142
-0.010271 0.031517 -0.229492 0.470624
OZONE_MAX -0.014712 0.088718 NaN NaN -0.288305 ... 0.085393
-0.053782 0.056310 -0.032154 -0.080527
OZONE_MIN 0.016655 0.023345 NaN NaN -0.069185 ... -0.021095
0.100634 0.018579 0.159597 -0.036567
PM10_MAX -0.288862 0.038256 NaN NaN 0.254617 ... -0.072332
0.033999 -0.003582 0.115276 0.027580
PM10_MIN -0.256055 0.001034 NaN NaN 0.229997 ... -0.076591
0.077383 -0.036099 0.009867 0.178765
PM2_MAX -0.271402 0.042085 NaN NaN 0.255919 ... -0.082036
0.038399 -0.018413 0.158809 -0.021888
PM2_MIN -0.249854 -0.001211 NaN NaN 0.225540 ... -0.072932
0.078671 -0.054172 0.018211 0.169836
SO2_MAX 0.042583 -0.016404 NaN NaN 0.063862 ... 0.027795
-0.071879 0.164260 -0.064666 0.152573
SO2_MIN 0.010488 -0.019259 NaN NaN 0.032783 ... 0.048649
-0.054404 0.119508 -0.193233 0.316796
CO_MAX 0.033494 0.032465 NaN NaN 0.412400 ... -0.188511
0.004618 0.094195 -0.039327 0.228064
CO_MIN 0.042300 -0.022054 NaN NaN 0.549969 ... -0.236444
0.012838 0.125869 -0.028176 0.390238
CO2_MAX -0.033887 0.000595 NaN NaN -0.026485 ... 0.025683
0.010616 0.009534 0.023452 -0.000432
CO2_MIN -0.081031 0.028052 NaN NaN 0.123964 ... -0.031739
0.017659 0.132243 0.190245 -0.049888
SOUND -0.101544 0.118385 NaN NaN 0.135084 ... 0.301812
0.034389 0.101464 -0.081280 0.120192
TEMPRATURE_MAX -0.014381 0.047581 NaN NaN 0.028936 ... -0.016208
-0.006466 -0.054032 0.208727 -0.254933
TEMPRATURE_MIN -0.072410 -0.060941 NaN NaN 0.321034 ... -0.193122
0.015924 -0.119544 0.034650 0.095099
UV_MAX -0.455358 0.183274 NaN NaN -0.293246 ... 1.000000
0.354292 0.049916 -0.006787 -0.253838
UV_MIN -0.448803 0.201387 NaN NaN -0.049395 ... 0.354292
1.000000 0.027530 0.060500 0.033563
AIR_PRESSURE 0.022563 0.028497 NaN NaN 0.018118 ... 0.049916
0.027530 1.000000 0.162451 -0.170239
Lattitude -0.023711 0.101492 NaN NaN -0.209901 ... -0.006787
0.060500 0.162451 1.000000 -0.400559
Longitude -0.036286 -0.048619 NaN NaN 0.472119 ... -0.253838
0.033563 -0.170239 -0.400559 1.000000

[26 rows x 26 columns]


##############################################
Output of Exercise 7
##############################################
##############################################
Output of Exercise 8
##############################################
##############################################
Mean of key variables
##############################################
Mean Values:
NO_MAX 0.000000e+00
NO2_MAX 9.095553e-17
PM10_MAX 3.031851e-17
HUMIDITY 1.136944e-16
TEMPRATURE_MAX 1.311276e-15
dtype: float64
##############################################
Median of key variables
##############################################
Median Values:
NO_MAX 0.000000e+00
NO2_MAX 1.199320e-01
PM10_MAX 0.000000e+00
HUMIDITY 4.450639e-16
TEMPRATURE_MAX 1.249673e-01
dtype: float64
##############################################
Standard Deviation of key variables
##############################################
Standard Deviation:
NO_MAX 0.000000
NO2_MAX 1.000033
PM10_MAX 1.000033
HUMIDITY 1.000033
TEMPRATURE_MAX 1.000033
dtype: float64
Day 3
Practical Exercise 9: Linear Regression
Objective:
Goal: instructions: Use Python’s scikit-learn library to implement a linear
regression model.
Steps for Linear Regression:
1. Dataset: We will use the Boston Housing dataset to predict the price of houses
based on various features.
2. Load the California Housing dataset using fetch_california_housing().
3. Split the data into training and testing sets.
4. Train a linear regression model using the training data.
5. Predict house prices on the testing data.
6. Evaluate the model using the Mean Squared Error (MSE) metric.

# Import necessary libraries


from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.datasets import fetch_california_housing
import numpy as np

# Load the dataset


housing = fetch_california_housing()
X = housing.data
y = housing.target

# Split the dataset into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)

# Train a Linear Regression model


regressor = LinearRegression()
regressor.fit(X_train, y_train)

# Predict on the test data


y_pred = regressor.predict(X_test)

# Evaluate the model using Mean Squared Error (MSE)


mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)

# Print coefficients
print("Coefficients:", regressor.coef_)

Output
##############################################
Evaluate the model using Mean Squared Error (MSE)
##############################################
Mean Squared Error: 0.5558915986952425
##############################################
Print coefficients
##############################################
Coefficients: [ 4.48674910e-01 9.72425752e-03 -1.23323343e-01
7.83144907e-01
-2.02962058e-06 -3.52631849e-03 -4.19792487e-01 -4.33708065e-01]

Practical Exercise 10: Logistic Regression


Objective:
Goal: Predict a binary outcome (e.g., whether a student will pass or fail) based on
input features (e.g., number of study hours).
Steps for Logistic Regression:
1. Mathematical Representation: Logistic regression uses a sigmoid function to
predict the probability of class membership.
2. Instructions: Implement Logistic Regression using the Iris dataset to classify
flowers into species based on petal length and width.

# Import necessary libraries


from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris

# Load the dataset


iris = load_iris()
X = iris.data
y = iris.target

# Split the dataset into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=42)

# Train a Logistic Regression model


classifier = LogisticRegression(max_iter=200)
classifier.fit(X_train, y_train)
# Predict on the test data
y_pred = classifier.predict(X_test)

# Evaluate the model using accuracy


accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Output
Accuracy: 1.0

Practical Exercise 11: Decision Trees


Objective:
Goal: Use a tree-like structure to split the dataset based on features and make
predictions. Each internal node represents a decision based on a feature, and
each leaf node represents the output label.
Steps for Logistic Regression:
1. Instructions: Train a decision tree to classify the species of flowers from the Iris
dataset.

# Import necessary libraries


from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris

# Load the dataset


iris = load_iris()
X = iris.data
y = iris.target

# Split the dataset into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=42)

# Train a Decision Tree classifier


dt_classifier = DecisionTreeClassifier()
dt_classifier.fit(X_train, y_train)

# Predict on the test data


y_pred = dt_classifier.predict(X_test)

# Evaluate the model using accuracy


accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Output
Accuracy: 1.0

Practical Exercise 12: Random Forests


Objective:
Goal: Use multiple decision trees (ensemble method) to improve the accuracy of
predictions. The model combines the predictions from each tree for a final
result.
Steps for Logistic Regression:
1. Instructions: Use a random forest classifier to predict the species of flowers from
the Iris dataset.

# Import necessary libraries


from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris

# Load the dataset


iris = load_iris()
X = iris.data
y = iris.target

# Split the dataset into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)

# Train a Random Forest classifier


rf_classifier = RandomForestClassifier()
rf_classifier.fit(X_train, y_train)

# Predict on the test data


y_pred = rf_classifier.predict(X_test)

# Evaluate the model using accuracy


accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Output
Accuracy: 1.0

Understand the Output


1. Supervised Learning includes regression (predicting continuous values) and
classification (predicting categorical outcomes).
2. Linear Regression is used for continuous predictions, while Logistic Regression,
Decision Trees, and Random Forests are popular for classification tasks.
3. Practical Exercises: Hands-on implementation of these models using Python’s
scikit-learn library provides a foundation for understanding supervised learning
techniques.
Day 4
Practical Exercise 13: K-Means Clustering
Objective:
Goal: K-Means Clustering is a method to partition a dataset into K distinct
clusters. It attempts to minimize the distance between data points and the
centroid (center of a cluster).
Steps for K-Means Clustering:
1. Choose the number of clusters K.
2. Initialize K cluster centroids randomly.
3. Assign each data point to the nearest centroid.
4. Recalculate the centroid of each cluster.
5. Repeat until convergence (no change in centroids).

# Import necessary libraries


from sklearn.cluster import KMeans
from sklearn.datasets import load_iris
import pandas as pd
import matplotlib.pyplot as plt

# Load the Iris dataset


iris = load_iris()
X = iris.data

# Apply KMeans clustering


kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)

# Visualize the clusters (using only 2 features for simplicity)


plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, cmap='viridis')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],
s=300, c='red', label='Centroids')
plt.xlabel(iris.feature_names[0])
plt.ylabel(iris.feature_names[1])
plt.title("K-Means Clustering")
plt.legend()
plt.show()

Output
Practical Exercise 14: Hierarchical Clustering
Objective:
Goal: Hierarchical Clustering builds a tree-like structure of nested clusters. There
are two types:

● Agglomerative: Start with each data point as its own cluster, then iteratively
merge the closest clusters.
● Divisive: Start with one cluster and recursively split it into smaller clusters.

Steps for Hierarchical Clustering


1. Dendrogram: A tree diagram used to visualize the merging of clusters.

# Import necessary libraries


from sklearn.datasets import load_iris
from scipy.cluster.hierarchy import dendrogram, linkage
import matplotlib.pyplot as plt
# Load the Iris dataset
iris = load_iris()
X = iris.data

# Perform hierarchical/agglomerative clustering


Z = linkage(X, method='ward')

# Plot the dendrogram


plt.figure(figsize=(10, 7))
dendrogram(Z, truncate_mode='level', p=5)
plt.title("Hierarchical Clustering Dendrogram")
plt.xlabel("Cluster Size")
plt.ylabel("Distance")
plt.show()

Output
Practical Exercise 15: Dimensionality Reduction
Objective:
Goal:In many real-world datasets, there are a large number of features, making
analysis complex. Dimensionality Reduction techniques reduce the number of
features while retaining important information.

Principal Component Analysis (PCA) is a popular technique for dimensionality


reduction.

PCA transforms the data into a set of linearly uncorrelated components


(principal components) that capture the maximum variance in the data. The
goal is to reduce the number of features while retaining the most important
variance.

Steps for Principal Component Analysis (PCA)


1. Standardize the data.
2. Compute the covariance matrix of the features.
3. Calculate eigenvectors and eigenvalues.
4. Select the top eigenvectors to form the new feature space.

# Import necessary libraries


from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt

# Load the Iris dataset


iris = load_iris()
X = iris.data
y = iris.target

# Apply PCA to reduce to 2 dimensions


pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

# Plot the reduced dimensions


plt.figure(figsize=(8,6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis')
plt.title("PCA - Iris Dataset")
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.show()
Output

Understand the Output


1. Clustering Techniques such as K-Means and Hierarchical Clustering help group
data points based on similarities without any predefined labels.
2. Dimensionality Reduction using techniques like PCA helps in reducing the number
of features in a dataset while retaining important information.
3. Practical Exercises: Implementing these algorithms in Python using libraries like
scikit-learn helps in visualizing and understanding the techniques.
Day 5
Practical Exercise 16: Grid Search
Objective:
Goal: We will tune hyperparameters for a Random Forest Classifier using both
Grid Search on the Iris dataset.
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load the Iris dataset


iris = load_iris()
X = iris.data
y = iris.target

# Split the dataset into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)

# Define the hyperparameter grid


param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [4, 6, 8, 10],
'criterion': ['gini', 'entropy']
}

# Initialize the RandomForestClassifier


rf = RandomForestClassifier(random_state=42)

# Perform Grid Search


grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5)
grid_search.fit(X_train, y_train)

# Best hyperparameters
print("Best Parameters: ", grid_search.best_params_)
print("Best Cross-Validation Score:
{:.2f}".format(grid_search.best_score_))

# Test the best model on the test set


best_rf = grid_search.best_estimator_
print("Test Accuracy: {:.2f}".format(best_rf.score(X_test, y_test)))
Output
##############################################
Best hyperparameters
##############################################
Best Parameters: {'criterion': 'gini', 'max_depth': 4, 'n_estimators':
100}
Best Cross-Validation Score: 0.94
##############################################
Test the best model on the test set
##############################################
Test Accuracy: 1.00

Practical Exercise 17: Random Search


Objective:
Goal: We will tune hyperparameters for a Random Forest Classifier using Random
Search on the Iris dataset.
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load the Iris dataset


iris = load_iris()
X = iris.data
y = iris.target

# Split the dataset into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)

# Define the hyperparameter space


param_dist = {
'n_estimators': [50, 100, 200, 300],
'max_depth': [None, 4, 6, 8, 10],
'criterion': ['gini', 'entropy'],
'bootstrap': [True, False]
}

# Initialize the RandomForestClassifier


rf = RandomForestClassifier(random_state=42)
# Perform Randomized Search
random_search = RandomizedSearchCV(estimator=rf,
param_distributions=param_dist, n_iter=10, cv=5, random_state=42)
random_search.fit(X_train, y_train)

# Best hyperparameters
print("Best Parameters: ", random_search.best_params_)
print("Best Cross-Validation Score:
{:.2f}".format(random_search.best_score_))

# Test the best model on the test set


best_rf_random = random_search.best_estimator_
print("Test Accuracy: {:.2f}".format(best_rf_random.score(X_test, y_test)))

Output
##############################################
Best hyperparameters
##############################################
Best Parameters: {'n_estimators': 200, 'max_depth': 6, 'criterion':
'entropy', 'bootstrap': True}
Best Cross-Validation Score: 0.94
##############################################
Test the best model on the test set
##############################################
Test Accuracy: 1.00

Practical Exercise 18: Neural Network for Classification (Keras)


Objective:
Goal: Keras is a high-level neural networks API, written in Python, that runs on
top of TensorFlow. It provides an easy way to build and train neural networks.

Key Concepts:

● Input Layer: Receives the input data.


● Hidden Layers: Intermediate layers where computations are performed.
● Output Layer: Produces the final output based on the data processed by the
hidden layers.
● Activation Functions: Introduce non-linearity into the model (e.g., ReLU, sigmoid).

from sklearn.model_selection import RandomizedSearchCV


from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load the Iris dataset


iris = load_iris()
X = iris.data
y = iris.target

# Split the dataset into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)

# Define the hyperparameter space


param_dist = {
'n_estimators': [50, 100, 200, 300],
'max_depth': [None, 4, 6, 8, 10],
'criterion': ['gini', 'entropy'],
'bootstrap': [True, False]
}

# Initialize the RandomForestClassifier


rf = RandomForestClassifier(random_state=42)

# Perform Randomized Search


random_search = RandomizedSearchCV(estimator=rf,
param_distributions=param_dist, n_iter=10, cv=5, random_state=42)
random_search.fit(X_train, y_train)

# Best hyperparameters
print("Best Parameters: ", random_search.best_params_)
print("Best Cross-Validation Score:
{:.2f}".format(random_search.best_score_))

# Test the best model on the test set


best_rf_random = random_search.best_estimator_
print("Test Accuracy: {:.2f}".format(best_rf_random.score(X_test, y_test)))

Output
Test Accuracy: 1.00

You might also like