Sec-D ML Practical File PDF
Sec-D ML Practical File PDF
LEARNING
Submitted by:
Ridham Kumar
Btech Cse 7th sem
17032100761
1.) Installation of anaconda navigator and introduction to various
tools/platform.
Anaconda is a trusted suite that bundles Python and R distributions.
Anaconda is a package manager and virtual environment manager, and it
includes a set of pre-installed software packages. The Anaconda open-source
ecosystem is mainly used for data science, machine learning, and large-scale
data analysis. Anaconda is popular because it’s simple to install, and it
provides access to almost all the tools and packages that data professionals
require, including the following:
1. Pandas
• Data in a dataset.
• Time series containing both ordered and unordered data.
• Rows and columns of matrix data are labelled.
• Unlabelled information
• Any other type of statistical information
NumPy is one of the most widely used open-source Python libraries, focusing
on scientific computation. It features built-in mathematical functions for
quick computation and supports big matrices and multidimensional data.
“Numerical Python” is defined by the term “NumPy.” It can be used in linear
algebra, as a multi-dimensional container for generic data, and as a random
number generator, among other things. Some of the important functions in
NumPy are arcsin(), arccos(), tan(), radians(), etc. NumPy Array is a Python
object which defines an N-dimensional array with rows and columns. In
Python, NumPy Array is preferred over lists because it takes up less memory
and is faster and more convenient to use.
Features:
The NumPy interface can be used to represent images, sound waves, and
other binary raw streams as an N-dimensional array of real values for
visualization. Numpy knowledge is required for full-stack developers to
implement this library for machine learning
3. Keras
1. It runs without a hitch on both the CPU (Central Processing Unit) and
GPU (Graphics Processing Unit).
2. Keras supports nearly all neural network models, including fully
connected, convolutional, pooling, recurrent, embedding, and so forth.
These models can also be merged to create more sophisticated
models.
3. Keras’ modular design makes it very expressive, adaptable, and suited
really well to cutting edge research.
4. Keras is a Python based framework, making it simple to debug and
explore different models and projects.
4. TensorFlow
Features:
• Responsive Construct: We can easily visualize each and every part of
the graph with TensorFlow, which is not possible with Numpy or SciKit.
• Adaptable: One of the most essential Tensorflow features is that it is
flexible in its operation related to Machine Learning models, which
means that it has modularity and allows you to make sections of it
stand alone.
• It is Simple to Train Machine Learning Models in TensorFlow: Machine
Learning models can be readily trained using TensorFlow on both the CPU
and GPU for distributed computing.
• Parallel Neural Network Training: TensorFlow allows you to train many
neural networks and GPUs at the same time.
• Open Source and a large community: Without a doubt, if it was
developed by Google, there is already a significant team of software
experts working on constant stability improvements. The nicest part
about this machine learning library is that it is open-source, which
means that anyone with internet access can use it.
5. SciPy
Scipy is a free, open-source Python library used for scientific computing, data
processing, and high-performance computing. The library contains a huge
number of user-friendly routines for quick computation. The package is
based on the NumPy extension, which allows for data processing and
visualization as well as high-level commands. Scipy is used for mathematical
computations alongside NumPy. NumPy enables the sorting and indexing of
array data, while SciPy stores the numerical code. Cluster, constants, fftpack,
integrate, interpolate, io, linalg, ndimage, odr, optimize, signal, sparse,
spatial, special, and stats are only a few of the many sub packages available
in SciPy. “from scipy import subpackage-name” can be used to import them
from SciPy. NumPy, SciPy library, Matplotlib, IPython, Sympy, and Pandas
are, however, the essential packages of SciPy
Features:
• SciPy’s key characteristic is that it was written in NumPy, and its array
makes extensive use of NumPy.
• SciPy uses its specialised submodules to provide all of the efficient
numerical algorithms such as optimization, numerical integration, and
many others.
• All functions in SciPy’s submodules are extensively documented.
SciPy’s primary data structure is NumPy arrays, and it includes
modules for a variety of popular scientific programming applications.
SciPy handles tasks like linear algebra, integration (calculus), solving
ordinary differential equations, and signal processing with ease.
6.) Matplotlib
A Python matplotlib script is structured so that a few lines of code are all that
is required in most instances to generate a visual data plot. The matplotlib
scripting layer overlays two APIs:
Naïve Bayes models are also known as simple Bayes or independent Bayes. All these
names refer to the application of Bayes’ theorem in the classifier’s decision rule. Naïve
Bayes classifier applies the Bayes’ theorem in practice. This classifier brings the power of
Bayes’ theorem to machine learning.
Naïve Bayes Classifier uses the Bayes’ theorem to predict membership probabilities for
each class such as the probability that given record or data point belongs to a particular
class. The class with the highest probability is considered as the most likely class. This is
also known as the Maximum A Posteriori (MAP).
The MAP for a hypothesis with 2 events A and B is MAP (A) = max (P (A | B))
= max (P (B | A) * P (A))
Here, P (B) is evidence probability. It is used to normalize the result. It remains the same,
So, removing it would not affect the result.
Working of Naïve Bayes' Classifier can be understood with the help of the
below example:
Suppose we have a dataset of weather conditions and corresponding target
variable "Play". So using this dataset we need to decide that whether we
should play or not on a particular day according to the weather conditions. So
to solve this problem, we need to follow the below steps:
Problem: If the weather is sunny, then the Player should play or not?
Outlook Play
0 Rainy Yes
1 Sunny Yes
2 Overcast Yes
3 Overcast Yes
4 Sunny No
5 Rainy Yes
6 Sunny Yes
7 Overcast Yes
8 Rainy No
9 Sunny No
10 Sunny Yes
11 Rainy No
12 Overcast Yes
13 Overcast Yes
Weather Yes No
Overcast 5 0
Rainy 2 2
Sunny 3 2
Total 10 5
Weather No Yes
Rainy 2 2 4/14=0.29
Sunny 2 3 5/14=0.35
P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)
P(Sunny)= 0.35
P(Yes)=0.71
P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)
P(Sunny|NO)= 2/4=0.5
P(No)= 0.29
P(Sunny)= 0.35
Now we will check the accuracy of the Naive Bayes classifier using the Confusion matrix.
Below is the code for it:
Output:
As we can see in the above confusion matrix output, there are 7+3= 10 incorrect predictions,
and 65+25=90 correct predictions.
4.) Compare various models on the given dataset and exploring the
concepts of overfitting and underfitting
You can perform this experiment with your favorite machine learning
algorithms. This is often not useful technique in practice, because by
choosing the stopping point for training using the skill on the test dataset it
means that the testset is no longer “unseen” or a standalone objective
measure. Some knowledge (a lot of useful knowledge) about that data has
leaked into the training procedure.
There are two additional techniques you can use to help find the sweet spot
in practice: resampling methods and a validation dataset.
How To Limit Overfitting
Both overfitting and underfitting can lead to poor model performance. But by
far the most common problem in applied machine learning is overfitting.
Overfitting is such a problem because the evaluation of machine learning
algorithms on training data is different from the evaluation we actually care
the most about, namely how well the algorithm performs on unseen data.
There are two important techniques that you can use when evaluating
machine learning algorithms to limit overfitting:
• Use a resampling technique to estimate model accuracy.
• Hold back a validation dataset.
The most popular resampling technique is k-fold cross validation. It allows
you to train and test your model k-times on different subsets of training data
and build up an estimate of the performance of a machine learning model on
unseen data. A validation dataset is simply a subset of your training data that
you hold back from your machine learning algorithms until the very end of
your project. After you have selected and tuned your machine learning
algorithms on your training dataset you can evaluate the learned models on
the validation dataset to get a final objective idea of how the models might
perform on unseen data. Using cross validation is a gold standard in applied
machine learning for estimating model accuracy on unseen data.
5.) Hierarchal clustering algorithm to cluster data stored in .csv
dataset
import numpy as np
n_features = 2,
centers = 4,
cluster_std = 1.6,
random_state = 50)
points = dataset[0]
# Create a dendrogram
DENDROGRAM
# Scattering Plots to See How the Data Looks Like
plt.scatter(dataset[0][:,0], dataset[0][:,1])
plt.scatter()
# Initialize variables
learning_rate = 0.1
iterations = 5000
N = y_train.size
# feedforward propagation
# on hidden layer
Z1 = np.dot(X_train, W1)
A1 = sigmoid(Z1)
# on output layer
Z2 = np.dot(A1, W2)
A2 = sigmoid(Z2)
# Calculating error
mse = mean_squared_error(A2, y_train)
acc = accuracy(A2, y_train)
results=results.append({"mse":mse, "accuracy":acc},ignore_index =True )
# backpropagation
E1 = A2 - y_train
dW1 = E1 * A2 * (1 - A2)
E2 = np.dot(dW1, W2.T)
dW2 = E2 * A1 * (1 - A1)
# weight updates
W2_update = np.dot(A1.T, dW1) / N
W1_update = np.dot(X_train.T, dW2) / N
W2 = W2 - learning_rate * W2_update
W1 = W1 - learning_rate * W1_update
results.mse.plot(title="Mean Squared Error")
<AxesSubplot:title={'center':'Mean Squared Error'}>
results.accuracy.plot(title="Accuracy")
<AxesSubplot:title={'center':'Accuracy'}>
# feedforward
Z1 = np.dot(X_test, W1)
A1 = sigmoid(Z1)
Z2 = np.dot(A1, W2)
A2 = sigmoid(Z2)