0% found this document useful (0 votes)
22 views30 pages

@vtudeveloper - in ISMLA Mod 5

The document outlines the syllabus and key components of a course on Intelligent Systems and Machine Learning Algorithms, focusing on an end-to-end machine learning project using real data. It covers essential steps such as problem framing, data acquisition, model training, and evaluation techniques including performance measures and error analysis. The course also delves into classification tasks, including binary and multiclass classification, and emphasizes the importance of using real-world datasets for practical learning.

Uploaded by

H M BRUNDA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views30 pages

@vtudeveloper - in ISMLA Mod 5

The document outlines the syllabus and key components of a course on Intelligent Systems and Machine Learning Algorithms, focusing on an end-to-end machine learning project using real data. It covers essential steps such as problem framing, data acquisition, model training, and evaluation techniques including performance measures and error analysis. The course also delves into classification tasks, including binary and multiclass classification, and emphasizes the importance of using real-world datasets for practical learning.

Uploaded by

H M BRUNDA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

S.D.M.

Jain matt Trust®


A.G.M RURAL COLLEGE OF ENGINEERING AND TECHNOLOGY, VARUR, HUBLI
DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

NOTES

Subject with code: Intelligent Systems and Machine Learning Algorithms (BEC515A)

Module 5

Prepared By
Prof. ASIF IQBAL MULLA
Department of Electronics and Communication

Syllabus :

End-to-end Machine learning Project: Working with real data, Look at the big picture, Get
the data, Discover and visualize the data, Prepare the data, select and train the model, Fine
tune your model. Classification: MNIST, training a Binary classifier, performance measure,
multiclass classification, error analysis, multi-label classification, multi-output classification

Textbook : Chapter 2, Chapter 3

Aurelien Geron, Hands-on Machine Learning with Scikit-Learn &Tensor Flow , O’Reilly,
Shroff Publishers and Distributors Pvt. Ltd 2019.

Prof. Asif Iqbal M, Dept of ECE 1


Intelligent Systems and Machine Learning Algorithms (BEC515A)

I. End-to-End Machine Learning Project

1. Working with Real Data

When you are learning about Machine Learning it is best to actually experiment with real
world data, not just artificial datasets. Fortunately, there are thousands of open datasets to
choose from, ranging across all sorts of domains. Here are a few places you can look to get
data:

• Popular open data repositories:


—UC Irvine Machine Learning Repository
—Kaggle datasets
—Amazon’s AWS datasets
• Meta portals (they list open data repositories):
—http://dataportals.org/
—http://opendatamonitor.eu/
—http://quandl.com/
• Other pages listing many popular open data repositories:
—Wikipedia’s list of Machine Learning datasets
—Quora.com question
—Datasets subreddit

For example California Housing Prices dataset from the StatLib repository 2 (see Figure 2-
1).

Prof. Asif Iqbal M, Dept of ECE 2


Intelligent Systems and Machine Learning Algorithms (BEC515A)

2. Look at the Big Picture

Frame the Problem

The first question to ask your boss is what exactly is the business objective; building a model
is probably not the end goal. How does the company expect to use and benefit from this
model?

This is important because it will determine how you frame the problem, what algorithms you
will select, what performance measure you will use to evaluate your model, and how much
effort you should spend tweaking it.

The next question to ask is what the current solution looks like (if any).
It will often give you a reference performance, as well as insights on how to solve the
problem.

Your boss answers that the district housing prices are currently estimated manually by
experts: a team gathers up-to-date information about a district, and when they cannot get the

Prof. Asif Iqbal M, Dept of ECE 3


Intelligent Systems and Machine Learning Algorithms (BEC515A)

median housing price, they estimate it using complex rules. This is costly and time-
consuming, and their estimates are not great.
This is why the company thinks that it would be useful to train a model to predict a district’s
median housing price given other data about that district.
Okay, with all this information you are now ready to start designing your system. First, you
need to frame the problem: is it supervised, unsupervised, or Reinforcement Learning? Is it a
classification task, a regression task, or something else? Should you use batch learning or
online learning techniques? Before you read on, pause and try to answer these questions for
yourself.

Select a Performance Measure

Your next step is to select a performance measure. A typical performance measure for
regression problems is the Root Mean Square Error (RMSE). Equation 2-1 shows the
mathematical formula to compute the RMSE.

Prof. Asif Iqbal M, Dept of ECE 4


Intelligent Systems and Machine Learning Algorithms (BEC515A)

Suppose that there are many outlier districts. In that case, you may consider using the Mean
Absolute Error (also called the Average Absolute Deviation; see Equation 2-2):

Check the Assumptions

Lastly, it is good practice to list and verify the assumptions that were made so far (by you or
others); this can catch serious issues early on.

Prof. Asif Iqbal M, Dept of ECE 5


Intelligent Systems and Machine Learning Algorithms (BEC515A)

3. Get the Data

Create the Workspace

You will need a number of Python modules: Jupyter, NumPy, Pandas, Matplotlib, and Scikit-
Learn. If you already have Jupyter running with all these modules installed, you can safely
skip to “Download the Data” . If you don’t have them yet, there are many ways to install
them (and their dependencies).

Install a Scientific Python distribution such as Anaconda and use its packaging system, or just
use Python’s own packaging system, pip, which is included by default with the Python binary
installers (since Python 2.7.9). Now you can install all the required modules and their
dependencies using this simple pip command.

Download the Data

In typical environments your data would be available in a relational database (or some other
common datastore) and spread across multiple tables/documents/files. In housing project,
however, things are much simpler: you will just download a single compressed file,
housing.tgz, which contains a comma-separated value (CSV) file called housing.csv with all
the data. You could use your web browser to download it, and run tar xzf housing.tgz to
decompress the file and extract the CSV file, but it is preferable to create a small function to
do that.

Now let’s load the data using Pandas. Once again you should write a small function to load
the data:

Prof. Asif Iqbal M, Dept of ECE 6


Intelligent Systems and Machine Learning Algorithms (BEC515A)

Take a Quick Look at the Data Structure

The info() method is useful to get a quick description of the data, in particular the total
number of rows, and each attribute’s type and number of non-null values (see Figure 2-6).

Let’s look at the other fields. The describe() method shows a summary of the numerical
attributes (Figure 2-7).

Prof. Asif Iqbal M, Dept of ECE 7


Intelligent Systems and Machine Learning Algorithms (BEC515A)

Create a Test Set

Creating a test set is theoretically quite simple: just pick some instances randomly, typically
20% of the dataset (or less if your dataset is very large), and set them aside. Let us look the
Codes for creating train and test set

Prof. Asif Iqbal M, Dept of ECE 8


Intelligent Systems and Machine Learning Algorithms (BEC515A)

4. Discover and Visualize the Data to Gain Insights

Visualizing Geographical Data

Prof. Asif Iqbal M, Dept of ECE 9


Intelligent Systems and Machine Learning Algorithms (BEC515A)

Looking for Correlations

Prof. Asif Iqbal M, Dept of ECE 10


Intelligent Systems and Machine Learning Algorithms (BEC515A)

The correlation coefficient ranges from –1 to 1. When it is close to 1, it means that there is a
strong positive correlation; for example, the median house value tends to go up when the
median income goes up. When the coefficient is close to –1, it means that there is a strong
negative correlation.

let’s just focus on a few promising attributes that seem most correlated with the median
housing value (Figure 2-15):

from pandas.plotting import scatter_matrix


attributes = ["median_house_value", "median_income", "total_rooms",
"housing_median_age"]
scatter_matrix(housing[attributes], figsize=(12, 8))

5. Prepare the Data for Machine Learning Algorithms

Prof. Asif Iqbal M, Dept of ECE 11


Intelligent Systems and Machine Learning Algorithms (BEC515A)

Prof. Asif Iqbal M, Dept of ECE 12


Intelligent Systems and Machine Learning Algorithms (BEC515A)

Min-max scaling (many people call this normalization) is quite simple: values are shifted and
rescaled so that they end up ranging from 0 to 1. We do this by subtracting the min value and
dividing by the max minus the min. Scikit-Learn provides a transformer called
MinMaxScaler for this.

Prof. Asif Iqbal M, Dept of ECE 13


Intelligent Systems and Machine Learning Algorithms (BEC515A)

6. Select and Train a Model

Prof. Asif Iqbal M, Dept of ECE 14


Intelligent Systems and Machine Learning Algorithms (BEC515A)

7. Fine-Tune Your Model

This param_grid tells Scikit-Learn to first evaluate all 3 × 4 = 12 combinations


ofn_estimators and max_features hyperparameter values specified in the first dict. then try all
2 × 3 = 6 combinations of hyperparameter values in the second dict, but this time with the
bootstrap hyperparameter set to False instead of True (which is the default value for this
hyperparameter). All in all, the grid search will explore 12 + 6 = 18 combinations of
RandomForestRe gressor hyperparameter values, and it will train each model five times
(since we are using five-fold cross validation). In other words, all in all, there will be 18 × 5 =
90 rounds of training.

Prof. Asif Iqbal M, Dept of ECE 15


Intelligent Systems and Machine Learning Algorithms (BEC515A)

Prof. Asif Iqbal M, Dept of ECE 16


Intelligent Systems and Machine Learning Algorithms (BEC515A)

II. Classification

1. MNIST

Prof. Asif Iqbal M, Dept of ECE 17


Intelligent Systems and Machine Learning Algorithms (BEC515A)

Prof. Asif Iqbal M, Dept of ECE 18


Intelligent Systems and Machine Learning Algorithms (BEC515A)

2. Training a Binary Classifier

3. Performance Measures
There are many performance measures available, so get ready to learn many new concepts
and acronyms!

Measuring Accuracy Using Cross-Validation

Prof. Asif Iqbal M, Dept of ECE 19


Intelligent Systems and Machine Learning Algorithms (BEC515A)

Let’s use the cross_val_score() function to evaluate your SGD Classifier model using K-fold
cross-validation, with three folds. Remember that K-fold cross validation means splitting the
training set into K-folds (in this case, three), then making predictions and evaluating them on
each fold using a model trained on the remaining folds.

Confusion Matrix

A much better way to evaluate the performance of a classifier is to look at the confusion
matrix. The general idea is to count the number of times instances of class A are classified as
class B. For example, to know the number of times the classifier confused images of 5s with
3s, you would look in the 5th row and 3rd column of the confusion matrix.
To compute the confusion matrix, you first need to have a set of predictions, so they can be
compared to the actual targets.

Instead, you can use the cross_val_predict() function:


from sklearn.model_selection import cross_val_predict
y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3)

Each row in a confusion matrix represents an actual class, while each column represents a
predicted class. The first row of this matrix considers non-5 images (the negative class):
53,057 of them were correctly classified as non-5s (they are called true negatives), while the
remaining 1,522 were wrongly classified as 5s (false positives).
The second row considers the images of 5s (the positive class): 1,325 were wrongly classified
as non-5s (false negatives), while the remaining 4,096 were correctly classified as 5s (true
positives).

Precision and Recall

Prof. Asif Iqbal M, Dept of ECE 20


Intelligent Systems and Machine Learning Algorithms (BEC515A)

So precision is typically used along with another metric named recall, also called sensitivity
or true positive rate (TPR): this is the ratio of positive instances that are correctly detected by
the classifier (Equation 3-2).

It is often convenient to combine precision and recall into a single metric called the F1 score,
in particular if you need a simple way to compare two classifiers. The F1 score is the
harmonic mean of precision and recall (Equation 3-3).

Prof. Asif Iqbal M, Dept of ECE 21


Intelligent Systems and Machine Learning Algorithms (BEC515A)

Precision/Recall Tradeoff

To understand this tradeoff, let’s look at how the SGD Classifier makes its classification
decisions. For each instance, it computes a score based on a decision function, and if that
score is greater than a threshold, it assigns the instance to the positive class, or else it assigns
it to the negative class. Figure 3-3 shows a few digits positioned from the lowest score on the
left to the highest score on the right.

Prof. Asif Iqbal M, Dept of ECE 22


Intelligent Systems and Machine Learning Algorithms (BEC515A)

The ROC Curve

The receiver operating characteristic (ROC) curve is another common tool used with binary
classifiers. It is very similar to the precision/recall curve, but instead of plotting precision
versus recall, the ROC curve plots the true positive rate (another name for recall) against the
false positive rate.

Prof. Asif Iqbal M, Dept of ECE 23


Intelligent Systems and Machine Learning Algorithms (BEC515A)

4. Multiclass Classification
Whereas binary classifiers distinguish between two classes, multiclass classifiers (also
called multinomial classifiers) can distinguish between more than two classes.

Some algorithms (such as Random Forest classifiers or naive Bayes classifiers) are
capable of handling multiple classes directly. Others (such as Support Vector Machine

Prof. Asif Iqbal M, Dept of ECE 24


Intelligent Systems and Machine Learning Algorithms (BEC515A)

classifiers or Linear classifiers) are strictly binary classifiers. However, there are various
strategies that you can use to perform multiclass classification using multiple binary
classifiers.
For example, one way to create a system that can classify the digit images into 10
classes (from 0 to 9) is to train 10 binary classifiers, one for each digit (a 0-detector, a 1-
detector, a 2-detector, and so on). Then when you want to classify an image, you get the
decision score from each classifier for that image and you select the class whose classifier
outputs the highest score. This is called the one-versus-all (OvA) strategy (also called one-
versus-the-rest). For most binary classification algorithms, however, OvA is preferred.
Another strategy is to train a binary classifier for every pair of digits: one to
distinguish 0s and 1s, another to distinguish 0s and 2s, another for 1s and 2s, and so on. This
is called the one-versus-one (OvO) strategy. If there are N classes, you need to train N × (N –
1) / 2 classifiers.
Let’s try this with the SGD Classifier:

That was easy! This code trains the SGDClassifier on the training set using the original target
classes from 0 to 9 (y_train), instead of the 5-versus-all target classes (y_train_5).

5. Error Analysis

Here, we will assume that you have found a promising model and you want to find ways to
improve it. One way to do this is to analyze the types of errors it makes.

Prof. Asif Iqbal M, Dept of ECE 25


Intelligent Systems and Machine Learning Algorithms (BEC515A)

Prof. Asif Iqbal M, Dept of ECE 26


Intelligent Systems and Machine Learning Algorithms (BEC515A)

Analyzing individual errors can also be a good way to gain insights on what your classifier is
doing and why it is failing, but it is more difficult and time-consuming. For example, let’s
plot examples of 3s and 5s (the plot_digits() function just uses Matplotlib’s imshow()
function; see this chapter’s Jupyter notebook for details):

Prof. Asif Iqbal M, Dept of ECE 27


Intelligent Systems and Machine Learning Algorithms (BEC515A)

The two 5×5 blocks on the left show digits classified as 3s, and the two 5×5 blocks on the
right show images classified as 5s. Some of the digits that the classifier gets wrong (i.e., in
the bottom-left and top-right blocks) are so badly written that even a human would have
trouble classifying them (e.g., the 5 on the 1st row and 2nd column truly looks like a badly
written 3).

6. Multilabel Classification

Prof. Asif Iqbal M, Dept of ECE 28


Intelligent Systems and Machine Learning Algorithms (BEC515A)

7. Multi output Classification

Prof. Asif Iqbal M, Dept of ECE 29


Intelligent Systems and Machine Learning Algorithms (BEC515A)

Prof. Asif Iqbal M, Dept of ECE 30

You might also like