ML Unit V
ML Unit V
1 0 0 1
AD8552 MACHINE LEARNING
11
Lecture Notes – Unit 5
UNIT V APPLICATIONS OF MACHINE LEARNING 8
Image Recognition – Speech Recognition – Email spam and Malware Filtering – Online fraud
A popular dataset that we haven't talked much about yet is the Olivetti face
dataset. The Olivetti face dataset was collected in 1990 by AT&T Laboratories Cambridge.
The dataset comprises facial images of 40 distinct subjects, taken at different times and
under different lighting conditions. In addition, subjects varied their facial expression
(open/closed eyes, smiling/not smiling) and their facial details (glasses/no glasses).
Images were then quantized to 256 grayscale levels and stored as unsigned
8-bit integers. Because there are 40 distinct subjects, the dataset comes with 40 distinct
target labels. Recognizing faces thus constitutes an example of a multiclass classification
task.
To get a sense of the dataset, plot some example images. Let's pick eight indices
from the dataset in a random order:
We can plot these example images using Matplotlib, but we need to make sure we
reshape the column vectors to 64 x 64 pixel images before plotting:
You can see how all the faces are taken against a dark background and are upright. The
facial expression varies drastically from image to image, making this an interesting
classification problem.
Before we can pass the dataset to the classifier, we need to preprocess it. Specifically, we
want to make sure that all example images have the same mean grayscale level:
We repeat this procedure for every image to make sure the feature values of every data point
(that is, a row in X) are centered around zero:
We continue to follow our best practice to split the data into training and test sets:
Because we have a large number of categories (that is, 40), we want to make sure the
random forest is set up to handle them accordingly:
We can play with other optional arguments, such as the number of data points required in
a node before it can be split:
However, we might not want to limit the depth of each tree. This is again, a parameter we will
have to experiment with in the end. But for now, let's set it to a large integer value, making the
depth effectively unconstrained:
We can check the resulting depth of the tree using the following function:
This means that although we allowed the tree to go up to depth 1000, in the end only 25
layers were needed.
The evaluation of the classifier is done once again by predicting the labels first (y_hat) and then
passing them to the accuracy_score function:
We find 87% accuracy, which turns out to be much better than with a single decision tree:
We can play with the optional parameters to see if we get better. The most important one
seems to be the number of trees in the forest. We can repeat the experiment with a forest
made from 100 trees:
Speech Recognition
Speech recognition is a machine's ability to listen to spoken words and identify them. You
can then use speech recognition in Python to convert the spoken words into text, make a
query or give a reply. You can even program some devices to respond to these spoken
words. You can do speech recognition in python with the help of computer programs that
take in input from the microphone, process it, and convert it into a suitable form.
Speech recognition seems highly futuristic, but it is present all around you.
Automated phone calls allow you to speak out your query or the query you wish to be
assisted on; your virtual assistants like Siri or Alexa also use speech recognition to talk to
you seamlessly.
Speech recognition starts by taking the sound energy produced by the person
speaking and converting it into electrical energy with the help of a microphone. It then
converts this electrical energy from analog to digital, and finally to text.
It breaks the audio data down into sounds, and it analyzes the sounds using
algorithms to find the most probable word that fits that audio. All of this is done using Natural
Language Processing and Neural Networks. Hidden Markov models can be used to find
temporal patterns in speech and improve accuracy.
To perform speech recognition in Python, you need to install a speech recognition package to
use with Python. There are multiple packages available online. The table below outlines some
of these packages and highlights their specialty.
Picking and installing a speech recognition package
For this implementation, you will use the Speech Recognition package. It allows:
Easy speech recognition from the microphone.
Makes it easy to transcribe an audio file.
It also lets us save audio data into an audio file.
It also shows us recognition results in an easy-to-understand format.
Now, create a program that takes in the audio as input and converts it to text.
Let’s create a function that takes in the audio as input and converts it to text.
Support Vectors are data points that are close to the hyperplane and it helps to maximize the
margin of the classifier. With the help of a support vector from both classes, we can form a
negative hyperplane and a positive hyperplane. So basically we want to maximize the distance
between the decision boundaries i.e. maximum margin hyperplane and support vectors from
both sides which will minimize the error.
The Kernel trick
SVM works very well on the linearly separable data i.e. the data points which
can be classified using a straight line. But the question is do we need to manually decide
which dimensional plane we are supposed to have for our dataset. So the answer is NO.
SVM has a technique called kernel trick which takes low dimensional input space and
converts into high dimensional space i.e. it converts non-separable problems into
separable problems.
What is Natural Language Processing (NLP)?
Human languages include unstructured forms of data i.e. texts and voices which cannot be
understood by computers. Natural Language Processing (NLP) is an Artificial Intelligence (AI)
field that enables computer programs to recognize, interpret, and manipulate human languages.
Prerequisites:
This project requires you to have a good knowledge of Python and Natural Language
Processing(NLP). Modules required for this project are pandas, pickle , sklearn , numpy and nltk.
You can install with using following command:
The versions which are used in this application for python and its corresponding modules are
as follows:
1) python : 3.8.5
2) sklearn : 0.24.2
3) pickle : 4.0
4) numpy : 1.19.5
5) pandas : 1.1.5
6) nltk : 3.2.5
Application Structure:
spam.csv: Dataset for our project. It contains Labels as “ham” or “spam” and Email Text.
spamdetector.py: This file is used to load the dataset and train our classifier.
training_data.pkl: This file contains a trained classifier in binary format which will be used
to predict the output.
SpamGui.py: Gui file for our project where we load the trained classifier and predict the
output for a given message.
In this, we will use python’s pandas’ module to read the dataset file which we are using for
training and testing purposes. Then we will use the “message_X” variable to store features
(EmailText column) and “labels_Y” variable to store target (Label column) from our dataset.
After we get the features and targets from our dataset we will clean the data. Firstly, we will
filter out all the non-alphabetic characters like digits or symbols, and then using the natural
language processing module ‘nltk’ we will tokenize our messages.
Also we will stem all the words to their root words.
Stemming: Stemming is the process of reducing words into their root words.
For example, if the message contains some error word like “frei” which might be misspelled for
“free”. Stemmer will stem or reduce that error word to its root word i.e. “fre”. As a result, “fre”
is the root word for both “free” and “frei”.
3) Bag of Words (vectorization)
Bag of words or vectorization is the process of converting sentence words into binary vector
format. It is useful as models require data to be in numeric format. So if the word is present in
that particular sentence then we will put 1 otherwise 0. This can be achieved by
“TFidfVectorizer”.
Also, we will remove words that do not add much meaning to our sentence which in technical
terms are called “stopwords”. For example, these words might be vowels, articles, or some
common words. So for this, we will add a parameter called “stop_words” in “TFidfVectorizer”.
For labels, we will replace “ham” with 0 and “spam” with 1 as we can have only two outputs.
Kernel: The kernel is decided on the basis of data transformation. By default, the kernel is
the Radial Basis Function kernel (RBF). We can change it to linear or Polynomial depending on
our dataset.
C parameter: The c parameter is a regularization parameter that tells the classifier how
much misclassification to avoid. If the value of C is high then the classifier will fit training data
very well which might cause overfitting. A low C value might allow more
misclassification(errors) which can lead to lower accuracy for our classifier.
Gamma: Gamma is a nonlinear hyperplane parameter. High values indicate that data points
that are very close to each other can be grouped. A low value indicates that data points can
be grouped together even if they are separated by large distances.
Now, we will save our classifier and other variables in binary or byte stream-like object
format files using python’s ‘pickle’ module which we will use in the GUI file for prediction.
And finally, after preprocessing and vectorizing the user input we will predict whether the
piece of message is “ham” or “spam”.
Machine Learning Spam Filtering Out])ut
Spam D tector X
Welcome to ProiectCurukul
Check
5.4 CREDIT CARD FRAUD DETECTION
Nowadays most people prefer to do payments by cards and don’t like to carry cash with them.
That leads to an increase in the use of cards and also thereby frauds.
Credit card frauds are easy to do, as we know that E-commerce and many other online sites
have increased the online payment modes, Which affects increasing the risk of online frauds.
Credit card fraud is becoming the most common fraud people tend to do.
Dataset
In credit card fraud detection project, we will use the dataset which is a csv file. The dataset
consists of transactions that occurred in two days, where there are 492 frauds out of 284,807
transactions. The dataset is highly unbalanced i.e in this most of the transactions are actual
transactions not the fraud one.
·f'O'tx3col s
credit card_data.tail()
credit_cerd_data.inf:()
--------------
Col...-. l01_..,,ll CIXW'IC
;
We are just performing Exploratory data analysis, just follow along to understand the dataset
better. And make it better so that our model can detect fraud and normal transactions
accurately and efficiently.
4) Splitting the data:
After analyzing and visualizing our data, now we will split our dataset in X and Y or say in
features and labels:
As you can see the model we have created gives 95% accuracy on training data. The
accuracy is very good as we are training our model on very less data. So on considering that
our model accuracy is good.
Now evaluating our model on test data:
5.5 Medical Diagnosis using Machine Learning:
Project Prerequisites:
Install the following libraries using pip :
The versions which are used in this parkinson’s Disease Detection project for python and its
corresponding modules are as follows:
1) python: 3.8.5
2) numpy: 1.19.5
3) pandas: 1.1.5
4) matplotlib: 3.2.2
5) sklearn: 0.24.2
2) Preprocessing
Read the ‘parkinsons.csv’ file in dataframe using the ‘pandas’ library
Fetch the features and targets from the dataframe. Features will be all columns except ‘name’
and ‘status’. Therefore we will drop these two columns. And our target will be ‘status’ column
which contains 0’s(no parkinson’s disease) and 1’s(has parkinson’s disease)
3) Normalization
We will scale our feature data in the range of -1(minimum value) and 1(maximum value).
Scaling is important because variables at different scales do not contribute equal fitting to the
model which may end up creating bias. For that we will be using ‘MinMaxScaler()’ to fit and
then transform the feature data.
6
0
PART -A
6
1
PART -A
S.N Question and Answer CO,K
o
10 Briefly Explain Logistic Regression. CO5,K1
Logistic regression is a classification algorithm used to predict a binary
outcome for a given set of independent variables. The output of logistic
regression is either a 0 or 1 with a threshold value of generally 0.5. Any
value above 0.5 is considered as 1, and any point below 0.5 is
considered as 0.
11 What are false positives and false negatives? CO5,K1
False positives are those cases in which the negatives are wrongly
predicted as positives. For example, predicting that a credit card
transaction is fraud when, in fact, its not fraud.
False negatives are those cases in which the positives are wrongly
predicted as negatives. For example, predicting that a credit card
transaction is not fraud when, in fact, its fraud.
12 What is accuracy? CO5,K1
It is the number of correct predictions out of all predictions made.
Accuracy = (TP+TN)/(The total number of Predictions)
13 How does logistic regression handle categorical variables? CO5,K1
The inputs to a logistic regression model need to be numeric. The
algorithm cannot handle categorical variables directly. So, they need to
be converted into a format that is suitable for the algorithm to process.
The various levels of a categorical variable will be assigned a unique
numeric value known as the dummy variable. These dummy variables
are handled by the logistic regression model as any other numeric
value.
14 Define Bagging. CO5,K1
Creating a different training subset from sample training data with
replacement is called Bagging. The final output is based on majority
voting.
6
2
Real time Applications in
day to day life and to
Industry
REAL TIME APPLICATIONS IN DAY TO DAY LIFE
AND TO INDUSTRY
Smart Replies: You must have observed how Gmail prompts simple phrases to respond
to emails like “Thank You”, “Alright”, “Yes, I’m interested”. These responses are
customized per email when ML and AI understand, estimate, and reflect on how one
counters over time.
68
Content Beyond Syllabus
Contents beyond the Syllabus
Primarily, CropIn uses technologies like AI to help customers analyze and interpret data to
derive real-time technical insights about standing crop and projects spanning geographies. Its
agri-business intelligence alternative named SmartRisk” frees agri-alternate information and
offers risk mitigation and forecasting for practical credit risk assessment and loan recovery
help.