0% found this document useful (0 votes)
34 views19 pages

Sec-D ML Practical File PDF

Uploaded by

Vedant Agnihotri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views19 pages

Sec-D ML Practical File PDF

Uploaded by

Vedant Agnihotri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

PRACTICAL FILE FOR MACHINE

LEARNING

Submitted by:
Ridham Kumar
Btech Cse 7th sem
17032100761
1.) Installation of anaconda navigator and introduction to various
tools/platform.
Anaconda is a trusted suite that bundles Python and R distributions.
Anaconda is a package manager and virtual environment manager, and it
includes a set of pre-installed software packages. The Anaconda open-source
ecosystem is mainly used for data science, machine learning, and large-scale
data analysis. Anaconda is popular because it’s simple to install, and it
provides access to almost all the tools and packages that data professionals
require, including the following:

• the Python interpreter



an extensive collection of packages
• Conda, a package and virtual environment management system
• Jupyter Notebook, a web-based interactive integrated development
environment (IDE) that combines code, text, and visualizations in the
same document
• Anaconda Navigator, a desktop application that makes it easy to
launch software packages that come with Anaconda distribution and
manages packages and virtual environments without using command-
line commands
How to Install Anaconda

Anaconda is a cross-platform Python distribution that you can install on


Windows, macOS, or different distributions of Linux.
NOTE If you already have Python installed, you don’t need to uninstall it. You
can still go ahead and install Anaconda and use the Python version that
comes along with Anaconda distribution.

• Download Anaconda installer for your operating system from:


https://www.anaconda.com/downloads.
• Once the download is complete, double-click the package to start
installing Anaconda. The installer will walk you through a wizard to
complete the installation; the default settings work well in most cases
• Click on Continue on the Introduction, Read Me, and License screens.
Click on Agree to continue the installation, once the prompt below
appears.
•On the Destination Select screen, select “Install for me only.” It is
recommended to install Anaconda on the default path; to do so, click
on Instal.Clhange
If you would
Installlike to installand
Location… Anaconda
changeon
thea installation
different location,
path.
click on
• On the PyCharm IDE screen, click on Continue to install Anaconda
without the PyCharm IDE.
• After the installation completes, click on Close on the Summary screen
to close the installation wizard.
• There are two ways to verify your Anaconda installation: Locate
Anaconda Navigator in the installed applications on your computer,
and double-click on its icon.

2.)introduction to Numpy, Scipy, Scilearn-kit, Pandas, keras,


matplotlib and tensor-flow packages in python.

1. Pandas

Pandas is a BSD (Berkeley Software Distribution) licensed open source


library. This popular library is widely used in the field of data science. They
are primarily used for data analysis, manipulation, cleaning, etc. Pandas
allow for simple data modelling and data analysis operations without the
need to switch to another language such as R. Usually, Python libraries use
the following types of data:

• Data in a dataset.
• Time series containing both ordered and unordered data.
• Rows and columns of matrix data are labelled.
• Unlabelled information
• Any other type of statistical information

Pandas can do a wide range of tasks, including:

• The data frame can be sliced using Pandas.


• Data frame joining and merging can be done using Pandas.
• Columns from two data frames can be concatenated using Pandas.
• In a data frame, index values can be changed using Pandas.
• In a column, the headers can be changed using Pandas.
• Data conversion into various forms can also be done using Pandas and many
more.
2. NumPy

NumPy is one of the most widely used open-source Python libraries, focusing
on scientific computation. It features built-in mathematical functions for
quick computation and supports big matrices and multidimensional data.
“Numerical Python” is defined by the term “NumPy.” It can be used in linear
algebra, as a multi-dimensional container for generic data, and as a random
number generator, among other things. Some of the important functions in
NumPy are arcsin(), arccos(), tan(), radians(), etc. NumPy Array is a Python
object which defines an N-dimensional array with rows and columns. In
Python, NumPy Array is preferred over lists because it takes up less memory
and is faster and more convenient to use.

Features:

1. Interactive: Numpy is a very interactive and user-friendly library.


2. Mathematics: NumPy simplifies the implementation of difficult
mathematical equations.
3. Intuitive: It makes coding and understanding topics a breeze.
4. Lot of Interaction: There is a lot of interaction in it because it is widely
utilised, hence there is a lot of open-source contribution.

The NumPy interface can be used to represent images, sound waves, and
other binary raw streams as an N-dimensional array of real values for
visualization. Numpy knowledge is required for full-stack developers to
implement this library for machine learning

3. Keras

Keras is a Python-based open-source neural network library that lets us


experiment with deep neural networks quickly. With deep learning becoming
more common, Keras emerges as a great option because, according to the
creators, it is an API (Application Programming Interface) designed for
humans, not machines. Keras has a higher adoption rate in the industry and
research community than TensorFlow or Theano. It is recommended that you
install the TensorFlow backend engine before installing Keras.
Features:

1. It runs without a hitch on both the CPU (Central Processing Unit) and
GPU (Graphics Processing Unit).
2. Keras supports nearly all neural network models, including fully
connected, convolutional, pooling, recurrent, embedding, and so forth.
These models can also be merged to create more sophisticated
models.
3. Keras’ modular design makes it very expressive, adaptable, and suited
really well to cutting edge research.
4. Keras is a Python based framework, making it simple to debug and
explore different models and projects.

Keras-powered features are already in use at various companies, for


instance, Netflix, Uber, Yelp, Instacart, Zocdoc, Square, and a slew of other
companies. It is particularly popular among firms that use deep learning to
power their products. Keras includes a lot of implementations of standard
neural network building elements such as layers, objectives, activation
functions, optimizers, and a slew of other tools for working with picture and
text data. It also includes a number of pre-processed data sets and pre-
trained models, such as MNIST, VGG, Inception, SqueezeNet, ResNet, etc.

4. TensorFlow

TensorFlow is a high-performance numerical calculation library that is open


source. It is also employed in deep learning algorithms and machine learning
algorithms. It was created by the Google Brain team researchers within the
Google AI organization and is currently widely utilized by math, physics, and
machine learning researchers for complicated mathematical computations.
TensorFlow is designed to be fast, and it employs techniques such as XLA
(XLA or Accelerated Linear Algebra is a domain-specific compiler for linear
algebra that can accelerate TensorFlow models with potentially no source
code changes.) to do speedy linear algebra computations.

Features:
• Responsive Construct: We can easily visualize each and every part of
the graph with TensorFlow, which is not possible with Numpy or SciKit.
• Adaptable: One of the most essential Tensorflow features is that it is
flexible in its operation related to Machine Learning models, which
means that it has modularity and allows you to make sections of it
stand alone.
• It is Simple to Train Machine Learning Models in TensorFlow: Machine
Learning models can be readily trained using TensorFlow on both the CPU
and GPU for distributed computing.
• Parallel Neural Network Training: TensorFlow allows you to train many
neural networks and GPUs at the same time.
• Open Source and a large community: Without a doubt, if it was
developed by Google, there is already a significant team of software
experts working on constant stability improvements. The nicest part
about this machine learning library is that it is open-source, which
means that anyone with internet access can use it.

TensorFlow is used on a regular basis, but only inadvertently, through


services like Google Voice Search and Google Photos. TensorFlow’s libraries
are developed entirely in C and C++. It does, however, have a sophisticated
Python front end. Your Python code will be compiled and run on the
TensorFlow distributed execution engine, which is written in C and C++.
TensorFlow has an almost infinite amount of applications, which is one of its
most appealing features

5. SciPy

Scipy is a free, open-source Python library used for scientific computing, data
processing, and high-performance computing. The library contains a huge
number of user-friendly routines for quick computation. The package is
based on the NumPy extension, which allows for data processing and
visualization as well as high-level commands. Scipy is used for mathematical
computations alongside NumPy. NumPy enables the sorting and indexing of
array data, while SciPy stores the numerical code. Cluster, constants, fftpack,
integrate, interpolate, io, linalg, ndimage, odr, optimize, signal, sparse,
spatial, special, and stats are only a few of the many sub packages available
in SciPy. “from scipy import subpackage-name” can be used to import them
from SciPy. NumPy, SciPy library, Matplotlib, IPython, Sympy, and Pandas
are, however, the essential packages of SciPy

Features:

• SciPy’s key characteristic is that it was written in NumPy, and its array
makes extensive use of NumPy.
• SciPy uses its specialised submodules to provide all of the efficient
numerical algorithms such as optimization, numerical integration, and
many others.
• All functions in SciPy’s submodules are extensively documented.
SciPy’s primary data structure is NumPy arrays, and it includes
modules for a variety of popular scientific programming applications.
SciPy handles tasks like linear algebra, integration (calculus), solving
ordinary differential equations, and signal processing with ease.

6.) Scikit Learn

Scikit Learn is an open-source library for machine learning algorithms that


runs on the Python environment. It can be used with both supervised and
unsupervised learning algorithms. The library includes popular algorithms as
well as the NumPy, Matplotlib, and SciPy packages. Scikit learns most well-
known use is for music suggestions in Spotify. Let us now deep dive into
some of the key features of Scikit Learn:

• Cross-Validation: There are several methods for checking the accuracy


of supervised models on unseen data with Scikit Learn for example
the train_test_split method, cross_val_score, etc.
•Unsupervised learning techniques: There is a wide range of
unsupervised learning algorithms available, ranging from clustering,
factor analysis, principal component analysis, and unsupervised neural
networks.
• Feature extraction: Extracting features from photos and text is a useful
tool (e.g. Bag of words)
Scikit Learn includes many algorithms and can be used for performing
common machine learning and data mining tasks such as dimensionality
reduction, classification, regression, clustering, and model selection.

6.) Matplotlib

Matplotlib is a cross-platform, data visualization and graphical plotting


library for Python and its numerical extension NumPy. As such, it offers a
viable open source alternative to MATLAB. Developers can also use
matplotlib’s APIs (Application Programming Interfaces) to embed plots in GUI
applications.

A Python matplotlib script is structured so that a few lines of code are all that
is required in most instances to generate a visual data plot. The matplotlib
scripting layer overlays two APIs:

• The pyplot API is a hierarchy of Python code objects topped


by matplotlib.pyplot

• An OO (Object-Oriented) API collection of objects that can be


assembled with greater flexibility than pyplot. This API provides direct
access to Matplotlib’s backend layers.

Matplotlib and Pyplot in Python

The pyplot API has a convenient MATLAB-style stateful interface. In fact,


matplotlib was originally written as an open source alternative for MATLAB.
The OO API and its interface is more customizable and powerful than pyplot,
but considered more difficult to use. As a result, the pyplot interface is more
commonly used. Understanding matplotlib’s pyplot API is key to
understanding how to work with plots:

• matplotlib.pyplot.figure: Figure is the top-level container. It includes


everything visualized in a plot including one or more Axes.
• matplotlib.pyplot.axes: Axes contain most of the elements in
a plot: Axis, Tick, Line2D, Text, etc., and sets the coordinates. It is the
area in which data is plotted. Axes include the X-Axis, Y-Axis, and
possibly a Z-Axis, as well.

3.) Build Classification models using Bayes Net, Naïve Bayes


models for given datasets in python. Find classification accuracy of
these models using confusion matrix.
Introduction to Naive Bayes algorithm In machine learning, Naïve Bayes classification is a
straightforward and powerful
algorithm for the classification task. Naïve Bayes classification is based on applying Bayes’
theorem with strong independence assumption between the features. Naïve Bayes
classification produces good results when we use it for textual data analysis such as
Natural Language Processing.

Naïve Bayes models are also known as simple Bayes or independent Bayes. All these
names refer to the application of Bayes’ theorem in the classifier’s decision rule. Naïve
Bayes classifier applies the Bayes’ theorem in practice. This classifier brings the power of
Bayes’ theorem to machine learning.
Naïve Bayes Classifier uses the Bayes’ theorem to predict membership probabilities for
each class such as the probability that given record or data point belongs to a particular
class. The class with the highest probability is considered as the most likely class. This is
also known as the Maximum A Posteriori (MAP).
The MAP for a hypothesis with 2 events A and B is MAP (A) = max (P (A | B))

= max (P (B | A) * P (A))/P (B)

= max (P (B | A) * P (A))

Here, P (B) is evidence probability. It is used to normalize the result. It remains the same,
So, removing it would not affect the result.

Working of Naïve Bayes' Classifier can be understood with the help of the
below example:
Suppose we have a dataset of weather conditions and corresponding target
variable "Play". So using this dataset we need to decide that whether we
should play or not on a particular day according to the weather conditions. So
to solve this problem, we need to follow the below steps:

1. Convert the given dataset into frequency tables.


2. Generate Likelihood table by finding the probabilities of given features.
3. Now, use Bayes theorem to calculate the posterior probability.

Problem: If the weather is sunny, then the Player should play or not?

Solution: To solve this, first consider the below dataset:

Outlook Play

0 Rainy Yes

1 Sunny Yes

2 Overcast Yes

3 Overcast Yes

4 Sunny No

5 Rainy Yes

6 Sunny Yes

7 Overcast Yes

8 Rainy No

9 Sunny No

10 Sunny Yes
11 Rainy No

12 Overcast Yes

13 Overcast Yes

Frequency table for the Weather Conditions:

Weather Yes No

Overcast 5 0

Rainy 2 2

Sunny 3 2

Total 10 5

Likelihood table weather condition:

Weather No Yes

Overcast 0 5 5/14= 0.35

Rainy 2 2 4/14=0.29

Sunny 2 3 5/14=0.35

All 4/14=0.29 10/14=0.71


Applying Bayes'theorem:

P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)

P(Sunny|Yes)= 3/10= 0.3

P(Sunny)= 0.35

P(Yes)=0.71

So P(Yes|Sunny) = 0.3*0.71/0.35= 0.60

P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)

P(Sunny|NO)= 2/4=0.5

P(No)= 0.29

P(Sunny)= 0.35

So P(No|Sunny)= 0.5*0.29/0.35 = 0.41

So as we can see from the above calculation that P(Yes|Sunny)>P(No|Sunny)

Hence on a Sunny day, Player can play the game.

Creating Confusion Matrix:

Now we will check the accuracy of the Naive Bayes classifier using the Confusion matrix.
Below is the code for it:

1. # Making the Confusion Matrix


2. from sklearn.metrics import confusion_matrix
3. cm = confusion_matrix(y_test, y_pred)

Output:

As we can see in the above confusion matrix output, there are 7+3= 10 incorrect predictions,
and 65+25=90 correct predictions.
4.) Compare various models on the given dataset and exploring the
concepts of overfitting and underfitting

Overfitting in Machine Learning:- Overfitting refers to a model that models


the training data too well.
Overfitting happens when a model learns the detail and noise in the training
data to the extent that it negatively impacts the performance of the model on
new data. This means that the noise or random fluctuations in the training
data is picked up and learned as concepts by the model. The problem is that
these concepts do not apply to new data and negatively impact the models
ability to generalize.
Overfitting is more likely with nonparametric and nonlinear models that have
more flexibility when learning a target function. As such, many
nonparametric machine learning algorithms also include parameters or
techniques to limit and constrain how much detail the model learns.
For example, decision trees are a nonparametric machine learning algorithm
that is very flexible and is subject to overfitting training data. This problem
can be addressed by pruning a tree after it has learned in order to remove
some of the detail it has picked up.
Underfitting in Machine Learning
Underfitting refers to a model that can neither model the training data nor
generalize to new data.
An underfit machine learning model is not a suitable model and will be
obvious as it will have poor performance on the training data.
Underfitting is often not discussed as it is easy to detect given a good
performance metric. The remedy is to move on and try alternate machine
learning algorithms. Nevertheless, it does provide a good contrast to the
problem of overfitting.
A Good Fit in Machine Learning
Ideally, you want to select a model at the sweet spot between underfitting
and overfitting.

This is the goal, but is very difficult to do in practice.

To understand this goal, we can look at the performance of a machine


learning algorithm over time as it is learning a training data. We can plot both
the skill on the training data and the skill on a test dataset we have held back
from the training process.
Over time, as the algorithm learns, the error for the model on the training
data goes down and so does the error on the test dataset. If we train for too
long, the performance on the training dataset may continue to decrease
because the model is overfitting and learning the irrelevant detail and noise
in the training dataset. At the same time the error for the test set starts to rise
again as the model’s ability to generalize decreases.
The sweet spot is the point just before the error on the test dataset starts to
increase where the model has good skill on both the training dataset and the
unseen test dataset.

You can perform this experiment with your favorite machine learning
algorithms. This is often not useful technique in practice, because by
choosing the stopping point for training using the skill on the test dataset it
means that the testset is no longer “unseen” or a standalone objective
measure. Some knowledge (a lot of useful knowledge) about that data has
leaked into the training procedure.
There are two additional techniques you can use to help find the sweet spot
in practice: resampling methods and a validation dataset.
How To Limit Overfitting
Both overfitting and underfitting can lead to poor model performance. But by
far the most common problem in applied machine learning is overfitting.
Overfitting is such a problem because the evaluation of machine learning
algorithms on training data is different from the evaluation we actually care
the most about, namely how well the algorithm performs on unseen data.
There are two important techniques that you can use when evaluating
machine learning algorithms to limit overfitting:
• Use a resampling technique to estimate model accuracy.
• Hold back a validation dataset.
The most popular resampling technique is k-fold cross validation. It allows
you to train and test your model k-times on different subsets of training data
and build up an estimate of the performance of a machine learning model on
unseen data. A validation dataset is simply a subset of your training data that
you hold back from your machine learning algorithms until the very end of
your project. After you have selected and tuned your machine learning
algorithms on your training dataset you can evaluate the learned models on
the validation dataset to get a final objective idea of how the models might
perform on unseen data. Using cross validation is a gold standard in applied
machine learning for estimating model accuracy on unseen data.
5.) Hierarchal clustering algorithm to cluster data stored in .csv
dataset

# Do the necessary imports

import numpy as np

import matplotlib.pyplot as plt

from sklearn.datasets import make_blobs

# Create a blob of 200 data points

dataset = make_blobs(n_samples = 200,

n_features = 2,

centers = 4,
cluster_std = 1.6,

random_state = 50)

# Calling only first Dataset

points = dataset[0]

# import libraries for Heirarchical Clustering

import scipy.cluster.hierarchy as sch

from sklearn.cluster import AgglomerativeClustering

# Create a dendrogram

dendrogram = sch.dendrogram(sch.linkage(points, method = 'ward'))

DENDROGRAM
# Scattering Plots to See How the Data Looks Like
plt.scatter(dataset[0][:,0], dataset[0][:,1])

# Perform the Actual Clustering

hc = AgglomerativeClustering(n_clusters = 4, affinity = 'eucledian', linkage = 'ward')


y_hc = hc.fit_predict(points)

plt.scatter(points[y_hc == 0,0], points[y_hc == 0,1], s= 100, c='cyan')

plt.scatter(points[y_hc == 1,0], points[y_hc == 1,1], s= 100, c='yellow')


plt.scatter(points[y_hc == 2,0], points[y_hc == 2,1], s= 100, c='red')

plt.scatter(points[y_hc == 3,0], points[y_hc == 3,1], s= 100, c='green')

plt.scatter()

6.) Build an ANN models with back propagation neural network


approach with .csv datasets for classification problem. Compute
accuracy of the classifier by considering test dataset.

# Import Libraries import numpy as np import pandas as


pd from sklearn.datasets import load_iris from
sklearn.model_selection import train_test_split import
matplotlib.pyplot as plt
# Load dataset

# Get features and target


X=pd.read_csv('irisData.csv').to_numpy()
y=pd.read_csv('Result.csv').to_numpy()
#Split data into train and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=20,
random_state=4)

# Initialize variables
learning_rate = 0.1
iterations = 5000
N = y_train.size

# number of input features


input_size = 4
# number of hidden layers neurons
hidden_size = 2

# number of neurons at the output layer


output_size = 3

results = pd.DataFrame(columns=["mse", "accuracy"])


# Initialize weights
np.random.seed(10)

# initializing weight for the hidden layer


W1 = np.random.normal(scale=0.5, size=(input_size, hidden_size))

# initializing weight for the output layer


W2 = np.random.normal(scale=0.5, size=(hidden_size , output_size))
def sigmoid(x):
return 1 / (1 + np.exp(-x))

def mean_squared_error(y_pred, y_true):


return (((y_pred - y_true)**2).sum() / (2*y_pred.size))

def accuracy(y_pred, y_true):


acc = y_pred.argmax(axis=1) == y_true.argmax(axis=1)
return acc.mean()
for itr in range(iterations):

# feedforward propagation
# on hidden layer
Z1 = np.dot(X_train, W1)
A1 = sigmoid(Z1)

# on output layer
Z2 = np.dot(A1, W2)
A2 = sigmoid(Z2)

# Calculating error
mse = mean_squared_error(A2, y_train)
acc = accuracy(A2, y_train)
results=results.append({"mse":mse, "accuracy":acc},ignore_index =True )

# backpropagation
E1 = A2 - y_train
dW1 = E1 * A2 * (1 - A2)

E2 = np.dot(dW1, W2.T)
dW2 = E2 * A1 * (1 - A1)

# weight updates
W2_update = np.dot(A1.T, dW1) / N
W1_update = np.dot(X_train.T, dW2) / N

W2 = W2 - learning_rate * W2_update
W1 = W1 - learning_rate * W1_update
results.mse.plot(title="Mean Squared Error")
<AxesSubplot:title={'center':'Mean Squared Error'}>

results.accuracy.plot(title="Accuracy")

<AxesSubplot:title={'center':'Accuracy'}>

# feedforward
Z1 = np.dot(X_test, W1)
A1 = sigmoid(Z1)

Z2 = np.dot(A1, W2)
A2 = sigmoid(Z2)

acc = accuracy(A2, y_test)


print("Accuracy: {}".format(acc))
Accuracy: 0.8

You might also like