0% found this document useful (0 votes)
23 views22 pages

ML Aml Cse It Lab Manual Final

Uploaded by

Dhyey Baldha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views22 pages

ML Aml Cse It Lab Manual Final

Uploaded by

Dhyey Baldha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Faculty of Degree Engineering – 083

Department of Computer Science & Engineering - 31

SEMESTER: 7

LAB MANUAL
Machine Learning 3170724

Name:

Enrollment No:

Batch:

DEPARTMENT OF COMPUTER SCIENCE &


ENGINEERING - 31
DR.SUBHASH TECHNICAL CAMPUS
Faculty of Degree Engineering – 083
Department of Computer Science & Engineering - 31

CERTIFICATE

Roll No.:- Enrollment No.:-

This is to certify that the practical work satisfactorily carried out


and hence recorded in this journal is the confide work of Mr .
/Miss_________________________________ student of
th
Computer Science & Engineering - 31 Semester 7

in the Machine Learning (3170724) Laboratory of Dr. Subhash


Technical Campus during the academic year 2024 -25.

Submission Date: ……………

Subject in Charge HOD

Examiner
Faculty of Degree Engineering – 083
Department of Computer Science & Engineering - 31

INDEX

SR. NO. AIM PAGE NO DATE SIGN


Introduction to various libraries, tools
1 used in Machine Learning.

Importing dataset and reading a


2 dataset using Pandas Library.

To clean the data and apply methods


3 to deal with Missing Values.

To deal with Outliers for data pre-


4 processing.

To implement Linear Regression in


5 Python.

To Evaluate Linear Regression in


6 Python.

To Implement Logistic Regression in


7 Python.

Implement KNN Algorithm for


8 Classification.
Practical - 1
Introduction to various libraries, tools used in Machine Learning.

Machine Learning, as the name suggests, is the science of programming a computer by which they are able to learn
from different kinds of data. A more general definition given by Arthur Samuel is – “Machine Learning is the field of
study that gives computers the ability to learn without being explicitly programmed.” They are typically used to solve
various types of life problems.
In the older days, people used to perform Machine Learning tasks by manually coding all the algorithms and
mathematical and statistical formulas. This made the processing time-consuming, tedious, and inefficient. But in the
modern days, it is become very much easy and more efficient compared to the older days with various python
libraries, frameworks, and modules. Today, Python is one of the most popular programming languages for this task
and it has replaced many languages in the industry, one of the reasons is its vast collection of libraries. Python
libraries that are used in Machine Learning are:

1. Numpy
2. Scipy
3. Scikit-learn
4. TensorFlow
5. Keras
6. PyTorch
7. Pandas
8. Matplotlib

Numpy

NumPy is a very popular python library for large multi-dimensional array and matrix processing, with the help of a
large collection of high-level mathematical functions. It is very useful for fundamental scientific computations in
Machine Learning. It is particularly useful for linear algebra, Fourier transform, and random number capabilities.

SciPy

SciPy is a very popular library among Machine Learning enthusiasts as it contains different modules for optimization,
linear algebra, integration and statistics. There is a difference between the SciPy library and the SciPy stack. The
SciPy is one of the core packages that make up the SciPy stack. SciPy is also very useful for image manipulation.

Scikit-learn

DR.SUBHASH TECHNICAL CAMPUS Page 1


Scikit-learn is one of the most popular ML libraries for classical ML algorithms. It is built on top of two basic Python
libraries, viz., NumPy and SciPy. Scikit-learn supports most of the supervised and unsupervised learning algorithms.
Scikit-learn can also be used for data-mining and data-analysis, which makes it a great tool who is starting out with
ML.

TensorFlow

TensorFlow is a very popular open-source library for high performance numerical computation developed by the
Google Brain team in Google. As the name suggests, Tensorflow is a framework that involves defining and running
computations involving tensors. It can train and run deep neural networks that can be used to develop several AI
applications. TensorFlow is widely used in the field of deep learning research and application.

Keras

It provides many inbuilt methods for groping, combining and filtering data.
Keras is a very popular Machine Learning library for Python. It is a high-level neural networks API capable of
running on top of TensorFlow, CNTK, or Theano. It can run seamlessly on both CPU and GPU. Keras makes it really
for ML beginners to build and design a Neural Network. One of the best thing about Keras is that it allows for easy
and fast prototyping.

PyTorch

PyTorch is a popular open-source Machine Learning library for Python based on Torch, which is an open-source
Machine Learning library that is implemented in C with a wrapper in Lua. It has an extensive choice of tools and
libraries that support Computer Vision, Natural Language Processing(NLP), and many more ML programs. It allows
developers to perform computations on Tensors with GPU acceleration and also helps in creating computational
graphs.

Pandas

Pandas is a popular Python library for data analysis. It is not directly related to Machine Learning. As we know that
the dataset must be prepared before training. In this case, Pandas comes handy as it was developed specifically for

DR.SUBHASH TECHNICAL CAMPUS Page 2


data extraction and preparation. It provides high-level data structures and wide variety tools for data analysis. It
provides many inbuilt methods for grouping, combining and filtering data.

Matplotlib

Matplotlib is a very popular Python library for data visualization. Like Pandas, it is not directly related to Machine
Learning. It particularly comes in handy when a programmer wants to visualize the patterns in the data. It is a 2D
plotting library used for creating 2D graphs and plots. A module named pyplot makes it easy for programmers for
plotting as it provides features to control line styles, font properties, formatting axes, etc. It provides various kinds of
graphs and plots for data visualization, viz., histogram, error charts, bar chats, etc,

Signature:

Date: _________________________

DR.SUBHASH TECHNICAL CAMPUS Page 3


Practical-2
Importing dataset and reading a dataset using Pandas Library.

import pandas as pd

# dataset
disease_df = pd.read_csv("/framingham.csv")
disease_df.drop(['education'], inplace = True, axis = 1)
disease_df.rename(columns ={'male':'Sex_male'}, inplace = True)

print(disease_df.head());

Output:

Signature:

Date: _________________________

DR.SUBHASH TECHNICAL CAMPUS Page 4


Practical-3
To clean the data and apply methods to deal with Missing Values.

When you have a dataset, the first step is to check which columns have missing data and how many. Let us use the
most famous dataset among Data science learns, of course, the Titanic survivor! Read the dataset using pandas
read_csv function as shown below.

import pandas as pd

# dataset
df = pd.read_csv("/content/titanic_data.csv")
print(df.head());

How to check which columns have missing data, and how many?
The ” isnull()” function is used for this. When you call the sum function along with isnull, the total sum of missing data
in each column is the output.

missing_values=df.isnull().sum()
print(missing_values)

Although we know how many values are missing in each column, it is essential to know the percentage of them
against the total values. So, let us calculate that in a single line of code.

mis_value_percent = 100 * df.isnull().sum() / len(df)


print(mis_value_percent)

DR.SUBHASH TECHNICAL CAMPUS Page 5


Dropping rows with Missing Values

It is a simple method, where we drop all the rows that have any missing values belonging to a particular column. As
easy as this is, it comes with a huge disadvantage. You might end up losing a huge chunk of your data. This will
reduce the size of your dataset and make your model predictions biased. You should use this only when the no of
missing values is very less.

For example, the ‘Embarked’ column has just 2 missing values. So, we can drop rows where this column is missing.
Follow the below code snippet.

print('Dataset before :', len(df))


df.dropna(subset=['Embarked'],how='any',inplace=True)
print('Dataset after :', len(df))
print('missing values :',df['Embarked'].isnull().sum())

Imputation with mean

When a continuous variable column has missing values, you can calculate the mean of the non-null values and use it
to fill the vacancies.

import numpy as np

df['Age']=df['Age'].replace(np.NaN,df['Age'].mean())
df['Age'][:10]

Imputation with median

df['Age']=df['Age'].replace(np.NaN,df['Age'].median())
df['Age'][:10]

Signature:

Date: _________________________
DR.SUBHASH TECHNICAL CAMPUS Page 6
Practical-4
To deal with Outliers for data pre-processing.

Outlier Detection And Removal


Here pandas data frame is used for a more realistic approach as real-world projects need to detect the outliers that
arose during the data analysis step, the same approach can be used on lists and series-type objects.

Dataset Used For Outlier Detection


The dataset used in this article is the Diabetes dataset and it is preloaded in the Sklearn library.

# Importing
import sklearn
from sklearn.datasets import load_diabetes
import pandas as pd
import matplotlib.pyplot as plt

# Load the dataset


diabetics = load_diabetes()

# Create the dataframe


column_name = diabetics.feature_names
df_diabetics = pd.DataFrame(diabetics.data)
df_diabetics.columns = column_name
print(df_diabetics.head())

Visualizing and Removing Outliers Using Box Plot

It captures the summary of the data effectively and efficiently with only a simple box and
whiskers. Boxplot summarizes sample data using 25th, 50th, and 75th percentiles. One can just get insights(quartiles,
median, and outliers) into the dataset by just looking at its boxplot.

# Box Plot
import seaborn as sns
sns.boxplot(df_diabetics['bmi'])

DR.SUBHASH TECHNICAL CAMPUS Page 7


import seaborn as sns
import matplotlib.pyplot as plt

def removal_box_plot(df, column, threshold):


sns.boxplot(df[column])
plt.title(f'Original Box Plot of {column}')
plt.show()

removed_outliers = df[df[column] <= threshold]

sns.boxplot(removed_outliers[column])
plt.title(f'Box Plot without Outliers of {column}')
plt.show()
return removed_outliers

threshold_value = 0.12

no_outliers = removal_box_plot(df_diabetics, 'bmi', threshold_value)

Z-score

Z- Score is also called a standard score. This value/score helps to understand that how far is the data point from the mean.
And after setting up a threshold value one can utilize z score values of data points to define the outliers.

Zscore = (data_point -mean) / std. deviation


In this example, we are calculating the Z scores for the ‘age’ column in the DataFrame df_diabetics using
the zscore function from the SciPy stats module. The resulting array z contains the absolute Z scores for each data point
in the ‘age’ column, indicating how many standard deviations each value is from the mean.

from scipy import stats


import numpy as np
z = np.abs(stats.zscore(df_diabetics['age']))
print(z)

import numpy as np

threshold_z = 2

outlier_indices = np.where(z > threshold_z)[0]


DR.SUBHASH TECHNICAL CAMPUS Page 8
no_outliers = df_diabetics.drop(outlier_indices)
print("Original DataFrame Shape:", df_diabetics.shape)
print("DataFrame Shape after Removing Outliers:", no_outliers.shape)

DR.SUBHASH TECHNICAL CAMPUS Page 9


Practical-5
To implement Linear Regression in Python.

Machine Learning is a branch of Artificial intelligence that focuses on the development of algorithms and statistical
models that can learn from and make predictions on data. Linear regression is also a type of machine-learning algorithm
more specifically a supervised machine-learning algorithm that learns from the labelled datasets and maps the data points
to the most optimized linear functions. which can be used for prediction on new datasets.

First of we should know what supervised machine learning algorithms is. It is a type of machine learning where the
algorithm learns from labelled data. Labeled data means the dataset whose respective target value is already known.
Supervised learning has two types:

Classification: It predicts the class of the dataset based on the independent input variable. Class is the categorical or
discrete values. like the image of an animal is a cat or dog?

Regression: It predicts the continuous output variables based on the independent input variable. like the prediction of
house prices based on different parameters like house age, distance from the main road, location, area, etc.

Types of Linear Regression

There are two main types of linear regression:

Simple Linear Regression

This is the simplest form of linear regression, and it involves only one independent variable and one dependent
variable. The equation for simple linear regression is:
y=β0+β1Xy=β0+β1X
where:
Y is the dependent variable
X is the independent variable
β0 is the intercept
β1 is the slope

Multiple Linear Regression


This involves more than one independent variable and one dependent variable. The equation for multiple linear
regression is:
y=β0+β1X1+β2X2+………βnXny=β0+β1X1+β2X2+………βnXn
where:
Y is the dependent variable
X1, X2, …, Xn are the independent variables
β0 is the intercept
β1, β2, …, βn are the slopes
The goal of the algorithm is to find the best Fit Line equation that can predict the values based on the independent
variables.
In regression set of records are present with X and Y values and these values are used to learn a function so if you want
to predict Y from an unknown X this learned function can be used. In regression we have to find the value of Y, So, a
function is required that predicts continuous Y in the case of regression given X as independent features.
What is the best Fit Line?
Our primary objective while using linear regression is to locate the best-fit line, which implies that the error between the
predicted and actual values should be kept to a minimum. There will be the least error in the best-fit line.
The best Fit Line equation provides a straight line that represents the relationship between the dependent and independent
variables. The slope of the line indicates how much the dependent variable changes for a unit change in the independent
variable(s).

DR.SUBHASH TECHNICAL CAMPUS Page 10


Here Y is called a dependent or target variable and X is called an independent variable also known as the predictor of Y.
There are many types of functions or modules that can be used for regression. A linear function is the simplest type of
function. Here, X may be a single feature or multiple features representing the problem.
Linear regression performs the task to predict a dependent variable value (y) based on a given independent variable (x)).
Hence, the name is Linear Regression. In the figure above, X (input) is the work experience and Y (output) is the salary
of a person. The regression line is the best-fit line for our model.
We utilize the cost function to compute the best values in order to get the best fit line since different values for weights or
the coefficient of lines result in different regression lines.

import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd

data_set= pd.read_csv('/content/salary_data.csv')

x= data_set.iloc[:, :-1].values
y= data_set.iloc[:, 1].values

# Splitting the dataset into training and test set.


from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 1/3, random_state=0)

#Fitting the Simple Linear Regression model to the training dataset


from sklearn.linear_model import LinearRegression
regressor= LinearRegression()
regressor.fit(x_train, y_train)

#Prediction of Test and Training set result


y_pred= regressor.predict(x_test)
x_pred= regressor.predict(x_train)

#Prediction of Test and Training set result


y_pred= regressor.predict(x_test)
x_pred= regressor.predict(x_train)

mtp.scatter(x_train, y_train, color="green")


mtp.plot(x_train, x_pred, color="red")
mtp.title("Salary vs Experience (Training Dataset)")
mtp.xlabel("Years of Experience")
mtp.ylabel("Salary(In Rupees)")
mtp.show()

DR.SUBHASH TECHNICAL CAMPUS Page 11


#visualizing the Test set results
mtp.scatter(x_test, y_test, color="blue")
mtp.plot(x_train, x_pred, color="red")
mtp.title("Salary vs Experience (Test Dataset)")
mtp.xlabel("Years of Experience")
mtp.ylabel("Salary(In Rupees)")
mtp.show()

Signature:

Date: _________________________

DR.SUBHASH TECHNICAL CAMPUS Page 12


Practical-6
To Evaluate Linear Regression in Python.
Evaluating linear regression models

There are various metrics in place that we can use to evaluate linear regression models. Since models can't be 100 percent
efficient, evaluating the model on different metrics can help us optimize the performance, fine-tune it, and obtain better
results. The metrics we can use include:
Mean Absolute Error(MAE) calculates the absolute difference between the actual and predicted values. We get the sum
of all the prediction errors and divide them by the total number of data points.

from sklearn.metrics import mean_absolute_error


print('MAE:', mean_absolute_error(y_test,y_pred))

Mean Squared Error(MSE):

This is the most used metric. It finds the squared difference between actual and predicted values. We get the sum of the
square of all prediction errors and divide it by the number of data points.

To get the MSE from the model, import the mean_squared_error class from sklearn.metrics module.

from sklearn.metrics import mean_squared_error


print("MSE",mean_squared_error(y_test,y_pred))

Root Mean Squared Error(RMSE) is the square root of MSE.

import numpy as np
print("RMSE",np.sqrt(mean_squared_error(y_test,y_pred)))

DR.SUBHASH TECHNICAL CAMPUS Page 13


R Squared(R2): R2 is also called the coefficient of determination or goodness of fit score regression function. It measures
how much irregularity in the dependent variable the model can explain. The R2 value is between 0 to 1, and a bigger value
shows a better fit between prediction and actual value.

from sklearn.metrics import r2_score


r2 = r2_score(y_test,y_pred)
print(r2)

Signature:

Date: _________________________

DR.SUBHASH TECHNICAL CAMPUS Page 14


Practical-7
To Implement Logistic Regression in Python.
Logistic regression is a statistical method for predicting binary classes. The outcome or target variable is dichotomous in
nature. Dichotomous means there are only two possible classes. For example, it can be used for cancer detection
problems. It computes the probability of an event occurrence.
It is a special case of linear regression where the target variable is categorical in nature. It uses a log of odds as the
dependent variable. Logistic Regression predicts the probability of occurrence of a binary event utilizing a logit function.
Linear Regression Equation:

Where y is a dependent variable and x1, x2 ... and Xn are explanatory variables.
Sigmoid Function:

Acquisition of data

CSV file which tells which of the users purchased/not purchased a particular product.

Loading data

Visualizing and splitting the dataset

DR.SUBHASH TECHNICAL CAMPUS Page 15


Logistic Regression Model

Training the Model

DR.SUBHASH TECHNICAL CAMPUS Page 16


Signature:

Date: _________________________

DR.SUBHASH TECHNICAL CAMPUS Page 17


Practical-8
Implement KNN Algorithm for Classification.

Acquisition of data

CSV file which tells which of the users purchased/not purchased a particular product.

Loading data

Preprocessing

Splitting the dataset

Training

DR.SUBHASH TECHNICAL CAMPUS Page 18


Signature:

Date: _________________________

DR.SUBHASH TECHNICAL CAMPUS Page 19

You might also like