ML Aml Cse It Lab Manual Final
ML Aml Cse It Lab Manual Final
SEMESTER: 7
LAB MANUAL
Machine Learning 3170724
Name:
Enrollment No:
Batch:
CERTIFICATE
Examiner
Faculty of Degree Engineering – 083
Department of Computer Science & Engineering - 31
INDEX
Machine Learning, as the name suggests, is the science of programming a computer by which they are able to learn
from different kinds of data. A more general definition given by Arthur Samuel is – “Machine Learning is the field of
study that gives computers the ability to learn without being explicitly programmed.” They are typically used to solve
various types of life problems.
In the older days, people used to perform Machine Learning tasks by manually coding all the algorithms and
mathematical and statistical formulas. This made the processing time-consuming, tedious, and inefficient. But in the
modern days, it is become very much easy and more efficient compared to the older days with various python
libraries, frameworks, and modules. Today, Python is one of the most popular programming languages for this task
and it has replaced many languages in the industry, one of the reasons is its vast collection of libraries. Python
libraries that are used in Machine Learning are:
1. Numpy
2. Scipy
3. Scikit-learn
4. TensorFlow
5. Keras
6. PyTorch
7. Pandas
8. Matplotlib
Numpy
NumPy is a very popular python library for large multi-dimensional array and matrix processing, with the help of a
large collection of high-level mathematical functions. It is very useful for fundamental scientific computations in
Machine Learning. It is particularly useful for linear algebra, Fourier transform, and random number capabilities.
SciPy
SciPy is a very popular library among Machine Learning enthusiasts as it contains different modules for optimization,
linear algebra, integration and statistics. There is a difference between the SciPy library and the SciPy stack. The
SciPy is one of the core packages that make up the SciPy stack. SciPy is also very useful for image manipulation.
Scikit-learn
TensorFlow
TensorFlow is a very popular open-source library for high performance numerical computation developed by the
Google Brain team in Google. As the name suggests, Tensorflow is a framework that involves defining and running
computations involving tensors. It can train and run deep neural networks that can be used to develop several AI
applications. TensorFlow is widely used in the field of deep learning research and application.
Keras
It provides many inbuilt methods for groping, combining and filtering data.
Keras is a very popular Machine Learning library for Python. It is a high-level neural networks API capable of
running on top of TensorFlow, CNTK, or Theano. It can run seamlessly on both CPU and GPU. Keras makes it really
for ML beginners to build and design a Neural Network. One of the best thing about Keras is that it allows for easy
and fast prototyping.
PyTorch
PyTorch is a popular open-source Machine Learning library for Python based on Torch, which is an open-source
Machine Learning library that is implemented in C with a wrapper in Lua. It has an extensive choice of tools and
libraries that support Computer Vision, Natural Language Processing(NLP), and many more ML programs. It allows
developers to perform computations on Tensors with GPU acceleration and also helps in creating computational
graphs.
Pandas
Pandas is a popular Python library for data analysis. It is not directly related to Machine Learning. As we know that
the dataset must be prepared before training. In this case, Pandas comes handy as it was developed specifically for
Matplotlib
Matplotlib is a very popular Python library for data visualization. Like Pandas, it is not directly related to Machine
Learning. It particularly comes in handy when a programmer wants to visualize the patterns in the data. It is a 2D
plotting library used for creating 2D graphs and plots. A module named pyplot makes it easy for programmers for
plotting as it provides features to control line styles, font properties, formatting axes, etc. It provides various kinds of
graphs and plots for data visualization, viz., histogram, error charts, bar chats, etc,
Signature:
Date: _________________________
import pandas as pd
# dataset
disease_df = pd.read_csv("/framingham.csv")
disease_df.drop(['education'], inplace = True, axis = 1)
disease_df.rename(columns ={'male':'Sex_male'}, inplace = True)
print(disease_df.head());
Output:
Signature:
Date: _________________________
When you have a dataset, the first step is to check which columns have missing data and how many. Let us use the
most famous dataset among Data science learns, of course, the Titanic survivor! Read the dataset using pandas
read_csv function as shown below.
import pandas as pd
# dataset
df = pd.read_csv("/content/titanic_data.csv")
print(df.head());
How to check which columns have missing data, and how many?
The ” isnull()” function is used for this. When you call the sum function along with isnull, the total sum of missing data
in each column is the output.
missing_values=df.isnull().sum()
print(missing_values)
Although we know how many values are missing in each column, it is essential to know the percentage of them
against the total values. So, let us calculate that in a single line of code.
It is a simple method, where we drop all the rows that have any missing values belonging to a particular column. As
easy as this is, it comes with a huge disadvantage. You might end up losing a huge chunk of your data. This will
reduce the size of your dataset and make your model predictions biased. You should use this only when the no of
missing values is very less.
For example, the ‘Embarked’ column has just 2 missing values. So, we can drop rows where this column is missing.
Follow the below code snippet.
When a continuous variable column has missing values, you can calculate the mean of the non-null values and use it
to fill the vacancies.
import numpy as np
df['Age']=df['Age'].replace(np.NaN,df['Age'].mean())
df['Age'][:10]
df['Age']=df['Age'].replace(np.NaN,df['Age'].median())
df['Age'][:10]
Signature:
Date: _________________________
DR.SUBHASH TECHNICAL CAMPUS Page 6
Practical-4
To deal with Outliers for data pre-processing.
# Importing
import sklearn
from sklearn.datasets import load_diabetes
import pandas as pd
import matplotlib.pyplot as plt
It captures the summary of the data effectively and efficiently with only a simple box and
whiskers. Boxplot summarizes sample data using 25th, 50th, and 75th percentiles. One can just get insights(quartiles,
median, and outliers) into the dataset by just looking at its boxplot.
# Box Plot
import seaborn as sns
sns.boxplot(df_diabetics['bmi'])
sns.boxplot(removed_outliers[column])
plt.title(f'Box Plot without Outliers of {column}')
plt.show()
return removed_outliers
threshold_value = 0.12
Z-score
Z- Score is also called a standard score. This value/score helps to understand that how far is the data point from the mean.
And after setting up a threshold value one can utilize z score values of data points to define the outliers.
import numpy as np
threshold_z = 2
Machine Learning is a branch of Artificial intelligence that focuses on the development of algorithms and statistical
models that can learn from and make predictions on data. Linear regression is also a type of machine-learning algorithm
more specifically a supervised machine-learning algorithm that learns from the labelled datasets and maps the data points
to the most optimized linear functions. which can be used for prediction on new datasets.
First of we should know what supervised machine learning algorithms is. It is a type of machine learning where the
algorithm learns from labelled data. Labeled data means the dataset whose respective target value is already known.
Supervised learning has two types:
Classification: It predicts the class of the dataset based on the independent input variable. Class is the categorical or
discrete values. like the image of an animal is a cat or dog?
Regression: It predicts the continuous output variables based on the independent input variable. like the prediction of
house prices based on different parameters like house age, distance from the main road, location, area, etc.
This is the simplest form of linear regression, and it involves only one independent variable and one dependent
variable. The equation for simple linear regression is:
y=β0+β1Xy=β0+β1X
where:
Y is the dependent variable
X is the independent variable
β0 is the intercept
β1 is the slope
import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd
data_set= pd.read_csv('/content/salary_data.csv')
x= data_set.iloc[:, :-1].values
y= data_set.iloc[:, 1].values
Signature:
Date: _________________________
There are various metrics in place that we can use to evaluate linear regression models. Since models can't be 100 percent
efficient, evaluating the model on different metrics can help us optimize the performance, fine-tune it, and obtain better
results. The metrics we can use include:
Mean Absolute Error(MAE) calculates the absolute difference between the actual and predicted values. We get the sum
of all the prediction errors and divide them by the total number of data points.
This is the most used metric. It finds the squared difference between actual and predicted values. We get the sum of the
square of all prediction errors and divide it by the number of data points.
To get the MSE from the model, import the mean_squared_error class from sklearn.metrics module.
import numpy as np
print("RMSE",np.sqrt(mean_squared_error(y_test,y_pred)))
Signature:
Date: _________________________
Where y is a dependent variable and x1, x2 ... and Xn are explanatory variables.
Sigmoid Function:
Acquisition of data
CSV file which tells which of the users purchased/not purchased a particular product.
Loading data
Date: _________________________
Acquisition of data
CSV file which tells which of the users purchased/not purchased a particular product.
Loading data
Preprocessing
Training
Date: _________________________