Image classification
Image classification
Learning Objectives
At the end of this session you will be able to:
● Familiarized with installation missing packages
● Perform image processing
● Perform feature extraction and selection
● Conduct model Evaluation
Introduction
The feature of a machine learning technique to categorize or classify an object into its
corresponding label with the help of learned discributer from hundreds of image is called as object
classification.
This is one of a supervised learning problem where the users must provide training data (set of
objects along with its labels) to the machine learning technique so that it learns how to categorize
each object (by learning the feature behind ) with respect to its class.
In this lab , you will be introduced into one such object classification problem namely malaria
detection and classification, which is a hard problem because there is a similarity between the
infected and the non infected. As you know machine learning is all about learning from the past
data, so huge dataset of malaria images to perform real-time malaria detection and classification.
Without caring too much on real-time malaria disease classification, in this lab you will see how
to perform a simple image classification task using opencv, and machine learning algorithms with
the help of python.
Feature Extraction
Features are the information or list of numbers that are extracted from an image. These are real-
valued numbers (integers, float or binary). There are a wider range of feature extraction algorithms
in Computer Vision.
When deciding about the features that could quantify infected from non infected, we could possibly
think of Color, Texture and Shape as the primary ones. This is an obvious choice to globally
quantify and represent the disease infected or non infected.
But this approach is less likely to produce good results, if we choose only one feature vector, as
these species have many attributes in common like the non infected will be similar to infected in
terms of color and so on. So, we need to quantify the image by combining different feature
descriptors so that it describes the image more effectively.
Global Feature Descriptors
These are the feature descriptors that quantifies an image globally. These don’t have the concept
of interest points and thus, takes in the entire image for processing. Some of the commonly used
global feature descriptors are
Color - Color Channel Statistics (Mean, Standard Deviation) and Color Histogram
Shape - Hu Moments, Zernike Moments
Texture - Haralick Texture, Local Binary Patterns (LBP)
Others - Histogram of Oriented Gradients (HOG), Threshold Adjacency Statistics (TAS)
Local Feature Descriptors
These are the feature descriptors that quantifies local regions of an image. Interest points are
determined in the entire image and image patches/regions surrounding those interest points are
considered for analysis. Some of the commonly used local feature descriptors are
scikit-image, opencv, to read cell image files and process them as required
matplotlib and seaborn to view cell images and plot some graphs
basic python os and math functionality
numpy and random to manipulate arrays and generate random numbers
scikit-learn, sklearn, to carry feature engineering, model fitting and hyperparameter searching
Installation of libraries
To install any library you can use any either conda or pip. For instance to install open cv search
pip install opencv on the web and then copy and past in the cmd terminal.
To install the mahotas package use this .
conda install -c conda-forge mahotas
Functions for global feature descriptors
To extract Hu Moments features from the image, cv2.HuMoments() function provided by
OpenCV will be used. The argument to this function is the moments of the image cv2.moments()
flatenned. This means the moment of the image is computed and converted it to a vector using
flatten(). Before doing that, the color image should be converted into a grayscale image as
moments expect images to be grayscale.
Haralick Textures
To extract Haralick Texture features from the image, the mahotas library will be used. The function
mahotas.features.haralick() will be used. Before doing that, the color image should be converted
into a grayscale image as haralick feature descriptor expect images to be grayscale.
Color Histogram
To extract Color Histogram features from the image, the cv2.calcHist() function provided by
OpenCV is will be used. The arguments it expects are the image, channels, mask, histSize (bins)
and ranges for each channel [typically 0-256). The histogram is then normalized using normalize()
function of OpenCV and return a flattened version of this normalized matrix using flatten().
For each of the training label name, loop through the corresponding folder to get all the images
inside it. For each image, first resize the image into a fixed size. Then, the three global features
and concatenate these three features using NumPy’s np.hstack() function is extracted. Keep track
of the feature with its label using those two lists created below - labels and global_features. You
could even use a dictionary here. Below is the code snippet to do these.
After extracting features and concatenating it, the data should be locally saved. Before saving this
data, the LabelEncoder() is used to encode the labels in a proper format. This is to make sure that
the labels are represented as unique numbers.
Training classifiers
After extracting, concatenating and saving global features and labels from the training dataset, it’s
time to train the system. To do that, the Machine Learning models need to be created. For creating
the machine learning model’s, scikit-learn library will be used.
The Logistic Regression, Linear Discriminant Analysis, K-Nearest Neighbors, Decision Trees,
Random Forests, Gaussian Naive Bayes and Support Vector Machine will be use as the machine
learning models. To understand these algorithms, please refer on the internet.
Furthermore, the train_test_split function provided by scikit-learn will be to split the training
dataset into train_data and test_data. By this way, the models are trained with the train_data and
test the trained model with the unseen test_data. The split size is decided by the test_size parameter.
All the necessary libraries to work with are imported and create a models list. This list will have
all the machine learning models that will get trained with the locally stored features.
Testing classifier
Use the code below to test the model you built.
Assignment
This question will use the malaria dataset once again. Again, create a test set consisting of 1/2 of
the data using the rest for training.
Prepare the features in the form of csv format and indicate the name of the features
Fit a randomForest model, decision tree,Knn,logistic regression model,linear discriminant
analysis,quadratic discriminant analysis, and Boosting algorithm model to the training data.
Predict the labels for the corresponding test data.
Compute the confusion matrix for the test data.
Compute the AUC (Area Under the Curve) for each classifier.
Plot ROC curves as evaluate on the test data.
Out of all classifiers used in this assignment, which would you choose as a final model for the
malaria data?