COC257-Commercial Applications of Vehicle Image Classification
COC257-Commercial Applications of Vehicle Image Classification
18COC257
B625726
COMMERCIAL APPLICATIONS
OF VEHICLE IMAGE CLASSIFICATION
WITH DATA MINING
by
Ross D. Massie
Supervisor: Dr S. Saravi
April 2019
Abstract 3
Keywords 3
Acknowledgements 3
1 - Introduction 4
1.1 - Aims 4
1.2 - Objectives 4
1.3 - Methodology 4
2 - Background 5
2.1 - General Analysis 5
2.2 - Motivation for undertaking the Project 5
3 - Literature Review 6
3.1 - Material Review 6
3.1.1 - Datasets 6
3.1.2 - Literature and Academia 6
3.2 - Gathering of findings 7
1
5.3.1 - ARFF formatting issues 22
5.3.2 - Colour space issues 24
5.3.3 - Removing attributes and expanding memory allocation 28
5.4 - Running Orange Classification 29
5.5 - Comparison of Models 34
5.5.1 - Results of the first quarter of the data 35
5.5.2 - Results of the second quarter of the data 37
5.5.3 - Results of the third quarter of the data 39
5.5.4 - Results of the fourth quarter of the data 40
5.6 - Weka Classification 40
6 - Conclusion 41
7 - References 42
2
Abstract
This research builds upon recent research into VMMR (Vehicle Make and Model Recognition) by
attempting to push these discoveries into the scope of commercial and practical applications. The
primary objective of this paper is to conceptualise, propose and implement a system which can be
utilised by an end user to perform safer automobile purchases. This is performed via recognising a
car’s model and make by a picture, then further proceeding to gather data information pertaining to
the car. This is then analysed, information is then presented to the end user of various metrics that
will provide a more accurate depiction of the car’s value. The market for this system would be on
sites such as Facebook’s “marketplace” and other car trading platforms.
Keywords
VMMR - Vehicle Make and Model Recognition
AUC - Area under ROC, the area under the receiver-operating curve
Recall - The number of true positives amongst all positive instances in the data
Acknowledgements
While writing this dissertation, I have been fortunate to have much support on hand. Firstly, I would
like to thank my supervisor, Dr. S. Saravi, whose expertise was indispensable in the inception of the
research topic and choice of tools in particular.
In addition, I would like to thank my family for their wisdom and invaluable support, I couldn’t have
done this without you. Finally, there are my friends, whose constant presence assisted greatly in
ensuring my ability to complete the dissertation. Thank you all.
3
1 - Introduction
1.1 - Aims
The ultimate aim of this project is to create a system which can operate on a photo, recognising
what vehicle class is in the image. Once the class has been found, the system will search external
datasets, pulling back information on the vehicle that may be of use to the end user.
This system’s intended use is for users buying cars in person, from which they can near-immediately
receive a summary of several relevant statistics relating to that car and that type of car in general.
1.2 - Objectives
The objectives for this project are as follows:
4. Analyse and learn to use Weka and Orange as platforms for image classification. Document
processes of how the software works and can be used in the pursuit of defined goals.
5. Produce solutions that fulfill the goal of the software, using Orange and Weka. Document
how these were produced.
6. Compare and contrast softwares to determine which is the most apt for the project,
including the pros and cons of each method.
7. Evaluate the findings of the project and whether the goals, aims and objectives have been
achieved.
1.3 - Methodology
This research study adopts a quantitative approach, emphasising a methodology that aims to solve a
real-world problem and prove this via numerical results to be evaluated. To perform this, research
into similar publications relating to how the image recognition of vehicle models is achieved. The
usage of this emphasis on secondary sources initially is appropriate as much of the theory into how
to achieve recognition has been concluded. Analysing these methods and building on them to
associate mined data regarding those recognised cars is to be primary research. To this end, various
papers and journals have been referenced and databases noted which have been of use in this
project.
4
2 - Background
2.1 - General Analysis
The problem addressed by this research originated from the emergence of the Facebook
marketplace and similar reasonably unregulated platforms for trading automobiles. It became clear
through a number of reports that a large proportion of cars sold on these platforms have been
involved in criminal activity or are otherwise scams.[1]
Additionally, there is an increased rate of people buying cars off the streets or from acquaintances
given the possibility for saving to be found in doing so. These methods are also far more unsafe than
traditional trading methods as the innate trust we have of people we know makes it more likely for a
person to buy a car without performing the necessary checks.[2]
To that end, the average person utilising these platforms could be said to have a much higher chance
of being scammed than if they used traditional trading methods such as car showrooms. This
problem exits as an inevitability of increasing interconnectivity coming with the rise of social
media.[3]
Finally, one nuance of the problem of in-person purchases outside the realms of the internet is that
there is normally a restricted time frame in which to agree to buy a product. Furthermore, it is
plausible that being seen to openly question and be suspicious of the validity of the product can
cause the vendor to become more aggressive. This requires for the solution to be usable covertly
and quickly, so that the investigation into the validity of the product isn’t necessarily perceived by
the vendor of the product.[4]
5
3 - Literature Review
3.1 - Material Review
3.1.1 - Datasets
This project has utilised several databases to facilitate the association of contained information to
the model of the scanned car. The first notable dataset is of 16,185 car images of 196 classes, which
will be used for training and testing the neural network itself, from Jonathan Krause et al[5]. This
dataset will be the basis for initial training and testing of the system as the content is substantial
enough to facilitate a high degree of accuracy.
There are also several datasets, which will be mined once the car’s model has been ascertained.
These will provide data regarding that model of car to the user, which may be useful in ensuring end
users will not become victims of scams.
One notable dataset is the Vehicle Safety Branch Recalls Database, provided by Data.gov[6]. This
dataset contains all the safety recalls issued by manufacturers circa 12 December 2013. This data will
be vital as the DVSA has stated that 1 in 13 used cars have safety recall issues. Additionally, owning a
car with such a recall could net a £2,500 fine. Therefore, buying such a car would be very ill advised
and this danger must be accounted for in the developed system.
Another dataset is the Car Theft Index, provided by Data.gov[7]. The data was last updated circa 9
February 2010, meaning that the data is likely of less use than the above datasets, however it will
likely still be of some utility. The dataset itself holds data from 2005 listing the cars which are most
likely to be stolen
The Anonymised MOT tests and results, provided by Data.gov[8], is to be implemented into the
system. This will be a very interesting dataset to use as the potential for mining is large, the data
dating from 2016 back to 2005. It is composed of MOT test results, including information about what
cars get what sort of MOT failures.
Two datasets publicly available on the website Kaggle have been gathered for use in the project. One
dataset is related to the estimated price of the second hand car scanned[9]. The second dataset is
related more generally to the original pricing of the scanned car and its general features[10].
By using vehicle make and model recognition more widely, this system can augment simply checking
the number plate, which the DVLA advises. They write regarding regions of interest, which is used as
a defining feature of how the cars are identified, which is similar to how the system proposed in this
paper is expected to distinguish car models and makes.
Within the paper of Hirudkar and Sherekar regarding comparing data mining tools, they concluded
Weka was “highly robust for a variety of users”, which prompted its usage in this project[12]. Orange
was also included in their work, which can “run various types of statistical tests and analyses and
6
create charts and graphs for the results”[13]. It was therefore chosen as the second software to trial
in regards to this project.
Lee H.J. spoke in their paper of how model and manufacturer identification needed further research
and that license plate recognition has been tried by many[14]. Their suggestion was a three-layer
back propagation network for scanning of number plates. In Rob Hull’s article for This is Money, he
noted how the DVSA reports that 1 in 13 used vehicles have a pending safety recall[15].
4.2.1 - Orange
One software for artificial intelligence is Orange, a Python suite. This project’s end goal and results
will first be attempted to be reached using this software. Primarily, the image recognition neural
network features. The method for producing image classification is straightforward. Orange’s user
interface is intuitive, using visual aids that you can link to each other with lines to symbolise data
flow.
To do so, the “Import Images” widget needs to be selected and moved into the operating area where
components can be linked. From there, this widget is edited to select the folder containing the
images to be used. The second widget is then added, “Image Embedding”, which runs the pictures
through a neural network extrapolating data from the images. The two widgets are linked, then the
embedded images are linked to “Test & Score”, which will attempt to evaluate the images to their
classifications based on this embedded data. A further widget, from the group “Model”, such as
“Logical Regression” is added to the “Test & Score” widget. This acts as the learner for “Test &
Score” as the way it will evaluate the images. Finally, a “Confusion Matrix” widget is added to “Test
& Score”, so that accuracy and evaluation may be performed after the fact.
7
Figure 1 - Example Orange workflow
4.2.2 - Weka
Weka as a software is centred towards data mining much more than it is towards image classification
and object recognition[16]. To use image classification features, the package for doing so needs to
be added on separately. The method for using images requires all the images to be within the same
folder. An ARFF file needs to be generated, which includes the image’s file name and the
classification corresponding to it. These two attributes are required to progress further. By adding
image filters to each image, additional attributes can be extracted out from the images, which are
numeric.
More than one filter may be applied in succession to yield further image attributes generated. This
may increase or decrease the accuracy of the classification depending on the data the software is
trying to classify from.
8
The settings can be found towards the top of the screen, along the menu bar. The nodes, links
between them and other preferences can be edited. Addons may also be installed, which can further
the functionality of Orange. Output logging can be extended to the level of debugging, so the
dataflow can be observed in varying levels of detail. Helpful example pieces and YouTube tutorials
can also be seen in the “Help” section of the interface.
The 5 applications that can be selected on the right hand side of the main window are varied, such as
an “experimenter” and a “workbench”. The “explorer” is the first of the five applications displayed. It
appears to be a window which includes the ability to open files, databases and URL links to data.
Filters may be applied to extrapolate additional data for analysis from the imported data. This
includes filters for both supervised and unsupervised purposes.
This “explorer” also includes additional tabs towards the top of the window. These include methods
to specify classifiers, cross validation folding and whether to specify a certain amount of data for
training. Clusterers and associators can also be used, which similarly can be split into usage with
training data and test data. Selection of attributes can be performed with attribute evaluators and
search method selection, such as subset evaluation and bestfirst searching. Finally, visualisation
tools are available from this application, allowing plot matrices to be calculated and displayed with
colouration depending on the class chosen to be represented.
“Experiment” is the second application displayed in the splash menu. A variety of experiment types
are available to be performed, such as cross-validation. The output for these experiments can be
selected, as can the output format, such as a ARFF file. Iteration control can be performed, so that
the experiment is carried out a certain number of times before the software has completed its goal.
Datasets and algorithms are available to be added, loaded and saved within the setup interface,
allowing for very specific targeting of certain processes for specific portions of data. Once the
experiment has been set up, the experiment can be run under those parameters within the “run”
tab. Additionally, the “analyse” tab specific sources can be imported for testing. These tests can be
configured with a high degree of accuracy, including certain rows and columns, the field to be
compared and the significance of that field. The output format can be changed. Standard deviations
and certain columns can be highlighted from the output, which is given in a clearly laid out window
to the right of the application.
The “KnowledgeFlow” application is laid out in a more list-like format than the previous applications.
It contains a interface into which nodes can be inserted from the left-side list containing categories.
Various types of nodes may be inserted, such as sources for data to be loaded, such as “ArffLoader”,
which loads ARFF files. Further, nodes such as classifiers, clusterers and evaluation nodes can be
loaded into the workspace. By double-clicking nodes it is possible to configure them individually to a
surprising degree of detail.
9
“Simple CLI” opens up a command line style window, which at time of writing has 10 commands
documented for use. The main usages seem to be listing the capabilities of classes within Java,
managing variables, executing and killing jobs.
“Workbench” seems to incorporate all the above applications into one window. A tab-style system is
placed at the top of the application, from which any of the aforementioned applications can be
switched to within the window, such as “Simple CLI”.
The all-in-one design of Orange is useful in the fact that you can clearly see all the processes being
undertaken on one screen, making usage for a beginner possibly easier than may be the case for
other alternatives. Weka itself having more functionality, which happens to be spread across more
applications and screens than that of Orange.
The linking process in Orange between nodes is very intuitive, with a simple click-and-drag system
being at the centre of how to connect the various components of a data mining functionality. Weka,
Orange’s ability to show what data is being transferred by connections clearly is also incredibly
useful for realising when data needs to be interpreted or converted before usage.
Weka appears to be a fully-equipped data mining tool without any further additions, however
Orange’s addon system is much more heavily used than that of Weka. This means, on the other
hand, that Orange can potentially be tailored for purpose with less unused functionality than Weka.
Orange seems to contain a number of visualisation tools, such as graphs and plots, more of which
can be imported by addons. Numerically, Weka seems to have an advantage in terms of visualisation
of data and analysis of results after the fact. Weka’s confusion matrix, for example, has a greater
amount of content than Orange’s counterpart confusion matrix.
Orange and Weka were the chosen software to evaluate. By browsing video tutorials and online
resources, progress has been made to continue creating a pair of neural networks which can then be
selected between. This has been easier with Orange than with Weka, as Weka is a much less reliable
software compared to Orange’s newer and more supported product.
The project has encountered few difficulties. One notable problem was the lack of research and
papers in applying technology to real-life use cases, at least in the field of image recognition. This
10
rendered the literature review somewhat difficult to complete as it almost entirely consists of a “gap
analysis” and mentions of inspiration from these somewhat-similar papers.
Additionally, the datasets procured are generally out of date, as many datasets held by the DVLA and
other bodies requires several thousands of pounds to purchase.
Windows itself as an operating system has posed several problems in the process of executing this
project. MATLAB’s file extension is identically identified as a Microsoft Access Database Shortcut.
This was an issue when attempting to open and evaluate classification information related to the
training images. Microsoft Access attempts to open the data, cannot do so, then proceeds to show a
new Access project. This was rectified by installing MATLAB.
As of now, in terms of changes of direction, there haven’t been many notable deviations from the
original plan. The viability of Weka being used given its age relative to newer technologies has been
caused the balance of prototype development to shift in favour of Orange.
To test Orange on the set of all car images, it was the case to test how Orange would work handling
so much data. Therefore, to do so, a simple hierarchical clustering algorithm was implemented to
check the utility of the software. This was in an effort to benchmark the performance of Orange as a
program before proceeding to create the classifiers required.
The main issue that was perceived in using Orange was that the program did not save progress and
every time a project was imported with the “Import Images” widget, this would cause the entire
workflow to be recalculated. This was an issue as in later stages of the project, when the entire set of
16,185 images are being imported and computed this could take up to around 18 hours.
Orange has to calculate such embeddings by sending the relevant data across to their servers, which
then remotely generate the image embeddings. Similarly, the models to be trained as trained
remotely, not on the host PC. This was an issue, as the PC being used for this project was a top-grade
consumer computer and by running the computations locally, the time to complete of these
embeddings and training on the subsequent embeddings could have been drastically faster by
cutting out communication time.
11
Figure 3 - Hierarchical clustering test workflow
The hierarchical algorithm completed without much issue, yielding some useful insights into how
similar and how varied the various car images are. With the hierarchical algorithm in place, it was
running with a low accuracy in clustering, given that the algorithm in question was not specialised
properly for clustering such varied images. The difference in lighting, positioning and the subtle
difference between different types of car from the same manufacturer - all these cause the accuracy
to suffer.
It was around this point that it became clear that the source data was definitely varied enough that if
the resultant models had high accuracy, the model could be used in a wide variety of environmental
conditions and variations. This subsequently hints to the models developed bearing significant
real-life usage as opposed to only theoretical accuracy.
The way by which Orange derives what the appropriate classes it is training for within the classifier
it’s making is by file structure. This contrasts with other programs such as Weka, which oftentimes
use config files to do so.
In this case, the ~16,500 unsorted images within the dataset would need to be classified into a file
structure which included 196 child folders, one for each class. From there, the entire root folder can
be selected and Orange can determine how to train the neural network on which classes and which
predictions are correct. This presented an issue, as the source images were not pre-sorted upon
download.
To do so, the annotations for which image belongs to which class were required. From this data it
would be possible to resort the source data image by image into the appropriate directory. The issue
with this was that the data was entirely within MATLAB files. However, this was not in an
appropriate format to access to facilitate reading and reorganising many tens of thousands of images
well.
12
5.2.1 - Formatting MATLAB to CSV
It was now the case that the MATLAB format needed to be converted into a more code-friendly
format. This was relatively simple, requiring that the annotations be opened within MATLAB then
converted by saving the files as .csv. This meant that the files could be opened within Excel to be
viewed more conveniently and imported into coding purposes.
However, the way the MATLAB annotations converted to Excel-readable format was not perfect or
suitable to be iterated through in a logical manner. Therefore, it was necessary to perform some
manipulation of the table to sort the information in a more optimal format. Doubly, the class names
themselves were stored separately from the actual pictures they related to. Instead, a second file
was used which assigned numbers to which class each picture related to. For example, “000001.jpg”
is associated with class “1” instead of directly relating to the name of the class itself.
Further, the file containing the string names of the classes was structured in such a way that there
was a single row in the Excel file, which contained all the names of the classes cell by cell. An
example of this can be seen below. The main concern with this is that it would make iterating
through these names - once again - more difficult than it needed to be. By transposing the data to a
single-column format, similar to how the image-class figure to the left is structured, this proved to
make the required sorting algorithm easier to implement.
13
Figure 4 - File Name to class number CSV
14
Figure 6 - Translated class name spreadsheet
15
Figure 8 - Unaltered car image names Figure 9 - Altered car image names
This performed well for such a simple solution, easily cutting out the irregular numbers of leading
zeros which could prove problematic to account for in a simple iterative program to sort these
thousands of pictures into the necessary subfolders for the Orange classification process.
However, there were some issues with coding this iterative algorithm. As seen below, there were
several problems regarding errors in the dataflow. One of these was that the Python method of
transferring one file between folders. The OS module is to be used, along with a specific way of
16
designating the directory, as shown below. This caused a slight slow in development as it was
worked through and completed upon.
Furthermore, there were several other issues that presented themselves. As visible below, the
console log shows that the code fails to recognise when the next directory should be created - i.e.
when the type of the previous image have all been sorted. The feedback message of the first
directory being created is received, however the code then iterates through only the first 196
images, believing that each picture should be its own class. This was rectified by fixing the lookup
section of the code, where the image’s class is checked before assuming that the program should
17
cycle to the next class. This prevented the mismatch of false classification and the creation of only
one subfolder as seen.
Figure 12 - Code to divide car images into subfolders by class failing to create multiple subfolders
At this point the algorithm was corrected further, as seen below. In addition, as you have seen in the
figures, the filenames have been corrected to remove spaces and insert underscores. This assists
with creating the necessary subfolders, as Windows does not accept spaces within directory naming
with certainty all the time. There was another small issue in the naming convention of some of the
classes. Certain classes of car included forward slashes and backslashes in their names. This posed
difficulties for composing the string name for the folders correctly, which needed to be corrected
manually in several areas of the naming CSV file.
18
Figure 13 - Working code to divide car images into subfolders by class
Furthermore, Figure 14 is a snapshot of the end product and code used to sort the photos correctly
into the 196 separate classes. By fixing the way the addressing string was composed, it was possible
to rectify the issue of only one directory being made in the previous version of the code.
Additionally, the results of this code can be seen in Figure 14. The algorithm successfully sorted
many thousands of images into unique subfolders, according to the categorisation of two CSV files.
19
Figure 14 - Final code to divide car images into subfolders by class with output log
20
Figure 15 - Car images divided into subfolders by class
As you can see below, this required a very particular layout within the file. The “relation” association
is required within the ARFF for the configuration to be loaded correctly, which is visible at the top as
“car_ims_master”. Secondarily, the “filename” attribute needed to be inputted as a string, so that
Weka knew which files within the chosen directory it needed to ingest and to have unique identifiers
to relate to the classes assigned.
The “data” attribute is arguably the most important attribute to define within the ARFF file. It is
where the previously defined attributes are related to each other and the artificial intelligence
algorithms can accurately calculate their accuracy. As you can see, the data has to be structured in a
very specific way. First, the full filename and extension needs to be used. This is so that Weka knows
which file corresponds to which associated data. Afterwards, the classification of that data is added.
21
Figure 16 - Custom ARFF for the car images
Furthermore, there were several instances where the Acura classes of car had naming conventions
when formatted as you see within the ARFF file, caused conflicting identical class names. This
needed to be rectified by singling out the problem classes and renaming them. These renaming
changes were then migrated across to all the necessary settings to change in Weka.
Similarly, Weka did not want to accept classes with erroneous characters, so the character such as
the slash in “C/V” needed to be replaced with “CV” along with several other instances of special
characters being used within the class names.
22
Figure 17 - Code for constructing custom ARFF file
The program for extracting, formatting and renaming the data for constructing the ARFF file is
located visible in Figure 17. Figure 18 is a figure of the raw input afterwards, with the classes
correctly added to the necessary image names.
23
Figure 18 - Final ARFF file, having converted JPG to PNG
24
Figure 19 - Importing the Image Filters package into Weka
One of the major issues encountered was the fact that when the ARFF file was loaded into the
program, an error regarding incorrect colour spacing and rasters would show. This caused many
issues as it is not a Weka-native error, hence the reports of similar errors online related to specific
errors in Python. As changing the source code of Weka was definitely outside the scope of the
project, it was clear that there had to be a notable change to the source data to attempt to solve this
error.
After consulting with experts in Weka, it was decided that the source data was required to be
converted to another colour space. As a elegant solution to this a Python script was written to cycle
through all of the pictures and convert them from JPEG to PNG format.
25
This worked to solve the error, allowing the data to be imported into Weka, as seen within Figure 21.
The filename and class attributes are clearly visible, along with the 16,185 instances of various car
pictures. From here, it was possible to move further ahead and begin to run filters to extrapolate
data for Weka to classify upon.
Figure 21 - Weka interface after loading ARFF, with filename and class visible
26
To navigate to the filters which are able to be run upon images, the “Choose” button must be
pressed and the chosen filter selected from the list. As the images are not in standard format by
which they can be processed, only special filters designed for this purpose from the Image Filters
plugin can be used. This can be seen below - this has a large list of filters listed under unsupervised
and the instance subset of unsupervised algorithms.
As you can see within Figure 23, each of the 195 distinct classes have been successfully imported. It
is also the case in the figure that a suitable filter has been chosen, along with the directory set to
that of the unsorted PNG images folder.
27
Figure 23 - Visual graph of the 196 classes
28
Figure 24 - The additional image features extracted from the filter used
29
Figure 25 - First workflow designed and used
There were several issues regarding the Figure 25 workflow. Primarily, by training all the models at
once the system required a huge amount of computing resources and time to complete. To this end,
the workflow was split down into subsections, one of which is visible within Figure 26. Additionally,
the source data was cut down to be able to be utilised within a 16GB RAM capability. This posed a
few problems, having trained the models on the whole dataset, Orange cannot Test & Score models
after the fact, for example when loading the model for predictions. To that end, for purposes of
testing the efficiency of the algorithms, it was only possible to evaluate the models against each
other during the training workflow. This means that the actual accuracy of the fully trained models
was unable to be documented, however it was the case that models trained using a quarter of the
data have been created to compare the relative accuracies and capabilities of each of the types of
models trained and used.
30
Figure 26 - Cut down training/testing workflow for performance improvement
Prior to restructuring and removing the unnecessary parts of the workflow to result with the Figure
26 outcome, you may see below the full workflow that was used for training and debugging the
original models. The data flowing through the workflow was saved at regular intervals during
development, in case it was required for backend comparisons and processing. Ultimately, there was
simply too much data to process saved from these stages to process into meaningful statistics that
aren’t already included in this project.
31
Figure 27 - Full diagnostic and model saving workflow
Here we move onto the basic building block from where the actual end user predictions on user
images will be implemented from. By loading a model and passing through the embedded image
data into Predictions, each class is designated a certain percentage of the time that it is classified as
a given model. Additionally, the class which the model is most confident that the image is will be
outputted as the primary predicted class.
Figure 28 - Prediction mechanism using trained models on inputted car image data
32
Figure 29 is the extrapolation of this, including the implementation of this technology into a manner
by which the end users may find useful. Useful information is loaded into the workflow, such as
miles per gallon, msrp price and other statistics. From there, the data is merged and concatenated
into the table outputted from the prediction which includes the predicted class. This is done by
cross-checking the predicted class with the names of the cars in the imported dataset. Finally, the
table and data is stripped down. This removes all the features that were used from the feature
embedding and other data which is not useful.
The formation of the workflow took a lot of trial and error to complete. The method of merging the
data took a good amount of manual work, using Excel to concatenate and strip data to a field which
could be compared exactly to the predicted class. Additionally, making sure the useful data that was
concatenated to the table wasn’t stripped out was a challenge, as there was little way to discern
how to do this other than trying all the various options available within the widget. Furthermore, the
output needs to be filtered to only include the information for the class which was predicted, shown
below is one of the steps needed for such processing.
33
Figure 30 - Tool to select only the data for the image entered into the system
The inputs and outputs of the system are uniform and logical. The input in its rawest form is the
>16,000 unsorted car images inside one folder. This is then manipulated by a piece of
custom-written Python code, which strips the unnecessary leading zeros from each of the images’
names. Once this has completed, another piece of custom code loads two Excel files, scanning
through the one which holds the filenames of each image. The class number associated with that
image is then cross-referenced with the other Excel file which holds the association of class numbers
to class names.
The code then creates a subfolder, with a name extracted from the aforementioned Excel file. The
name within the Excel for the given class is formatted for use in windows folder names, such as
removing spaces and replacing them with underscores. Once the folder is created, the code will
continue iterating through the list of filenames for each image and their associated class numbers.
All images will be sorted into the folder which has the title that is the same as their class name.
Once sorted into their respective subfolders, the whole root folder is passed into Orange, which uses
the subfolder directories to denote classes within its framework. This is what the various models are
trained and tested upon, after the images are individually passed through an embedding widget
which extracts additional features to use within the classifier. Given processing power available,
dividing the full amount of images into quarters was best to give accurate testing for the classifiers.
The output from this process is four models (Logistic Regression, Support Vector Machine, Neural
Network and Naive Bayes) for each quarter of the data.
34
These models can then be evaluated against each other, using the confusion matrices and other
metrics noted below to come to a decision on which model is best for real world usage. To run
predictions, the image to be classified is run through the embedding component to extract features
to classify against. The model to classify with is loaded into the predictions component, the output
from this portion of the process is the class which the model believes the image belongs to, or each
image belongs to if multiple are loaded.
Once this class is decided, the class name is passed forwards to a component which merges that
information with the loaded dataset to associate with the class. Once this is further input to a
component which removes all the unnecessary cells which were merged from the loaded dataset.
This outputs a single row of data for each car image input to the system as a whole. This row will not
only include the classification of the image but also any useful information pertaining to that class
from the dataset.
As for the metrics noted in the tables within this section of this paper, the abbreviations are as
follows. AUC is defined as Area under ROC, the area under the receiver-operating curve. CA is
defined as Classification Accuracy, the proportion of correctly classified examples. Precision is the
proportion of true positives to positives, to measure false positives. Recall is the number of true
positives amongst all positive instances in the data. F1 is a weighted harmonic mean of precision and
recall.
In particular, the confusion matrix is the analytical metric that is most used within this research. It
shows the percentage of misclassifications for any given class across all classes. This is useful for
deducing which classes are most prone to being misclassified and which classes are the ones that
cause this misclassification.
As seen here, the most accurate model is Logistic Regression. Conversely, Naive Bayes is the least
accurate of the models. The difference in accuracy between these models is around 20%, as derived
from the below metrics of CA (Classification Accuracy), Precision and Recall. The maximum accuracy
of any model is that of logistic regression, which for the first quarter of the data is around 53%.
Therefore, the Naive Bayes solution is nearly 35% worse as a solution than Logistic Regression.
Figure 31 - Accuracy and metrics table for the first quarter of the car images
35
Figure 32 - Confusion matrix of neural network model for the first quarter of the car images
Figure 33 - Confusion matrix of support vector machine model for the first quarter of the car images
Figure 34 - Confusion matrix of logistic regression model for the first quarter of the car images
There are some very interesting findings to be extracted from the confusion matrices in Figure 32, 33
and 34. The ordering below of the tables is as such, Neural Network, Support Vector Machine,
Logistic Regression. It was not possible within Orange to calculate a confusion matrix with Naive
Bayes. Across all results, it is clear that cars of the same model are generally difficult to distinguish
36
from each other. Alternatively, cars which are distinctive and had no other cars of the same brand
within the model’s trained classes are nearing total certainty of prediction.
A great example of this is how the AM Hummer has a very distinctive shape of car with no cars of
similar make within the model’s classes. Therefore, as visible in the above tables is how the accuracy
is near 100%, with only 1.1% uncertainty. Alternatively, the Acura RL Sedan 2012, Acura TL Sedan
2012 and Acura TSX Sedan 2012 have very prominent scattering within the confusion matrix. Ideally,
the names of these would be more distinguishable on the tables, however Orange’s scaling limited
the names as seen.
Specifically, the fifth model of Acura down the table shows this principle of how the similar models
within the classifier can cause issues regarding cars. As you can see, the third model of Acura listed
confuses the classification of the fifth model of Acura, representing nearly 20% of misclassifications
of that model. Furthermore, the second model of Acura accounts for over 6% of misclassification and
the sixth model represents nearly 5%. This totals to around 30% of the otherwise very scattered
misclassifications.
Figure 35 - Accuracy and metrics table for the second quarter of the car images
The findings from the first quarter are backed up once more with the results from the second
quarter of the data. The best solution is clearly logistic regression, the worst being naive bayes. The
matrices below are in the order from Figure 35.
Figure 36 - Confusion matrix of support vector machine model for the second quarter of the car images
37
Figure 37 - Confusion matrix of neural network model for the second quarter of the car images
Figure 38 - Confusion matrix of naive bayes model for the second quarter of the car images
Figure 39 - Confusion matrix of logistic regression model for the second quarter of the car images
Similarly to the first quarter of usable data, this is a small slice of the around 50 classes trained on
each quarter of the data. Much like as we saw from the results of the first quarter, there is clearly no
uncertainty upon the models being able to correctly classify the car’s model and make regardless of
the weather conditions, lighting, colour and angle of the vehicle. There are some cases within Figure
31 where models of vehicle that appear very similar to other models cause misclassification which is
notable. However, so far we have yet to see this cause more misclassification into a certain incorrect
class than the correct class. The most prominent example of this is the logistic regression algorithm
38
in Figure 31, the first Chevrolet inside the table is misclassified as the second Cadillac almost half as
much as it is correctly classified.
Figure 40 - Accuracy and metrics table for the third quarter of the car images
The first half of the data has shown that Logistic Regression is the most accurate of the models, so
for the final half this will be the matrix focused on primarily. In the case of this third quarter, the top
two most accurate models - Logistic Regression and Neural Network, Logistic Regression yields a
model ~5% more accurate than the Neural Network. This is much greater than the Naive Bayes
model, ~23% more accurate and classifies correctly nearly 50% more of the time than Naive Bayes.
Figure 41 - Confusion matrix of logistic regression model for the third quarter of the car images
The confusion matrix in Figure 41 is a great example of how more distinctive shapes of cars allow
easier, more accurate classification than more generic silhouettes. The Fiat 500 and most Ferrari
models are very distinctive from the majority of the other cars classified within this third quarter of
the car image data. Even with three Ferrari types within the model, the scattering that would have
occurred with a less distinctive model did not occur.
Ordinarily in this case, one or several percent of the total misclassification instances would be
scattered across a number of classes. With a very distinctive visual appearance for a type of car, this
does not occur nearly as much. An example within Figure 41 is the Fisker, which has 5.7%
misclassification even within the sample of six classes chosen for this confusion matrix - this error
being spread across just three of them.
39
5.5.4 - Results of the fourth quarter of the data
Figure 42 - Accuracy and metrics table for the fourth quarter of the car images
The fourth quarter of data shows a very large anomaly compared to the other three quarters. The
Naive Bayes model has not handled modelling well with this quarter, showing a drop of 40%
accuracy compared to the third quarter to under 10% accuracy. This is an outlier from the other
three quarters of the car images, which shows the value of analysing the data in sections rather than
as a whole.
Figure 43 - Confusion matrix of logistic regression model for the fourth quarter of the car images
It is clear from the confusion matrices in this section that there are trends within the data. If the car
looks very similar to other types of car, there will be a large error towards these classes, reducing the
total correct proportion of classifications. Another trend identified within the data is that having a
large number of very varied classes, such as the ~50 classes per quarter, results in a notable
cumulative error. In many cases, a few percent of the total errors will be dispersed within a number
of classes, increasing the error with the number of classes within the model.
I believe The process depicted within this research can be of use to academics attempting to use
Weka for image analysis. The methods explored within this research can be used to counter errors
that may present from the usage of Weka in this regard, saving time otherwise unnecessarily used to
correct such errors.
40
6 - Conclusion
The research conclusion is that it is possible to create a solution within Orange which allows a car
image to be analysed, a prediction of the make and model of that car to be generated with
reasonable accuracy and this prediction to then be associated with external datasets. In particular,
an SVM solution is the most apt for real life usage.
This can be data that may influence product purchasing, such as MSRP and MPG of the car type
predicted. For end users, this process only takes a number of seconds and can be used with
consistent accuracy.
Ultimately, below is an example, real output of the working Orange system. The amount of data
added is much more than displayed, however it would be infeasible to input such as a figure as it
would become unreadable.
This system will be of use to legitimately help many people in the process of purchasing new vehicles
in person from less than reputable sources. The usage of running a picture through this
Python-based workflow and having the system return useful information regarding that car within a
few seconds is tangible.
The development of this solution required a large amount of independent research, including the
learning and understanding of two seperate data mining frameworks - Weka and Orange. The niches
and specific information to properly utilise both systems required extensive testing and iterative
correction of solutions.
The flow of data through the system is as follows. More than 16,000 unsorted car images are the
source data, which is then manipulated by a piece of custom-written Python code, which removes
the leading zeros from each of the images’ names. When completed, another secondary custom
program loads two Excel files, iterating across the first spreadsheet,which contains the filenames of
each image. The class number related to that image is checked against the other Excel file,
containing the relationships of class numbers to class names.
This program creates a subfolder for each class name found within the relationships Excel file. Each
class name found is converted for use in windows folder naming convention, such as removing
spaces and replacing them with underscores. The car images are then sorted into the folder which
has the title equivalent to their class name.
When sorted into their respective subdirectories, the root folder is loaded into Orange, using the
subfolder directories to separate the contained images into classes within its framework. The various
models are trained and tested upon this, once the images are passed one at a time through an
embedding widget which filters and adds additional features for use within the classifier.
Orange’s system of work requires data submission to Orange’s servers, therefore dividing the full
amount of images into quarters was best to give accurate testing for the classifiers. If this was not
the case, the submission of more than 16,000 images to Orange’s servers at once would have caused
issues as encountered during the preliminary stages of using the framework.
41
Ultimately, four models (Logistic Regression, Support Vector Machine, Neural Network and Naive
Bayes) are created for each quarter of the data. The models are then evaluated evaluated using the
confusion matrices and other metrics to determine the most effective model for real world end user
usage.
Four models were created for each quarter of the car image data, for a total of 16 models. These
were each individually analysed within section 5.5. Overall, the ranking of the four archetypes of
model were: Logistic Regression, Support Vector Machine, Neural Network and Naive Bayes. The
accuracy in some cases was near 99%. However, the main issue identified with the models within 5.5
was that the training of visually similar cars, most notably of the same make or general model,
caused nearly all of the error.
This is understandable as human experts may be unable to distinguish differing models of car from
the same manufacturer some of the time. After analysing the error displayed by the confusion
matrices from testing the models, any given class tested upon is classified correctly at least twice as
much as any single incorrect class.
To run predictions, the image to be classified is run through the embedding component to extract
features to classify against. The model to classify with is loaded into the predictions component, the
output from this portion of the process is the class which the model believes the image belongs to,
or each image belongs to if multiple are loaded.
Once this class is predicted, the class name merged with information from the external dataset,
where information within pertains to the class. This is done by usage of concatenation of tables
within Orange. The unnecessary data that may have been merged from the dataset is subsequently
removed to only present useful information. The output is one record in the resultant table for every
car image that was input for prediction.
This research required several custom code pieces to be written to interpret, preprocess and order
the data for ingestion by the machine learning frameworks. This required notable proficiency with
Python, to facilitate systematic and sequential operations to be executed to ensure that the base
data of unordered car images could be processed into classifiers.
There are several areas the system could be further developed, such as translating the Orange
workflow into raw Python code for extended usability. The addition of other databases into the
system or translating the databases noted in the literature review to be implemented into the
workflow.
7 - References
[1] - The Sun. (2018, Feb. 19). Used car scam warning as rogue dealers use Facebook and Gumtree to
shift dodgy motors [Online]. Available:
https://www.thesun.co.uk/motors/5538653/used-car-scam-warning-as-rogue-dealers-use-facebook
-and-gumtree-to-shift-dodgy-motors/
[2] - The Zebra. (2018, July. 3). Buying a Car Out of State: Do’s, Don’ts, and Paperwork You’ll Need
[Online]. Available:
https://www.thezebra.com/insurance-news/1946/buying-a-car-privately/
[3] - Express. (2018, Aug. 6). Car scams on the rise in the UK - Here’s how to avoid becoming a victim
of the crime [Online]. Available:
42
https://www.express.co.uk/life-style/cars/999650/car-scam-online-hack-advice-tips
[4] - Citizens Advice. (2015, Oct. 8). Buying a used car [Online]. Available:
https://www.citizensadvice.org.uk/consumer/buying-or-repairing-a-car/buying-a-used-car/
[5] - 3D Object Representations for Fine-Grained Categorization, Jonathan Krause, Michael Stark,
Jia Deng, Li Fei-Fei, 4th IEEE Workshop on 3D Representation and Recognition, at ICCV 2013
(3dRR-13). Sydney, Australia. Dec. 8, 2013.
[6] - Vehicle Safety Branch Car Recalls Database, data.gov.uk,
https://data.gov.uk/dataset/18c00cf3-3bb2-4930-b30d-78113113aaa7/vehicle-safety-branch-recalls
-database, 2013.
[7] - Car Theft Index, data.gov.uk,
https://data.gov.uk/dataset/cd4cdeb8-a199-4aef-8158-e567d0a2ac5a/car-theft-index, 2010.
[8] - Anonymised MOT Tests and Results, data.gov.uk,
https://data.gov.uk/dataset/e3939ef8-30c7-4ca8-9c7c-ad9475cc9b2f/anonymised-mot-tests-and-re
sults, 2017.
[9] - Second Hand Car Price Estimation, Xioa Jin,
https://www.kaggle.com/bahamutedean/secondhand-car-price-estimation, 2018.
[10] - Car Features and MSRP, Cooper Union,
https://www.kaggle.com/CooperUnion/cardataset, 2016.
[11] - S. Saravi and E. A. Edirisinghe, "Vehicle Make and Model Recognition in CCTV footage," 2013
18th International Conference on Digital Signal Processing (DSP), Fira, 2013, pp. 1-6.
[12] - Comparative Analysis of Data Mining Tools and Techniques for Evaluating Performance of
Database System, Arpita M. Hirudkar, Mrs. S. S. Sherekar
[13] - Orange, https://orange.biolab.si/, 2019
[14] - Lee H.J. (2006) Neural Network Approach to Identify Model of Vehicles. In: Wang J., Yi Z.,
Zurada J.M., Lu BL., Yin H. (eds) Advances in Neural Networks - ISNN 2006. ISNN 2006. Lecture Notes
in Computer Science, vol 3973. Springer, Berlin, Heidelberg
[15] - DVSA says 1-13 used cars punishable safety recall, Rob Hull,
https://www.thisismoney.co.uk/money/cars/article-5409173/DVSA-says-1-13-used-cars-punishable-
safety-recall.html, 21 February 2018.
[16] - Weka, https://www.cs.waikato.ac.nz/ml/weka/, 2019
[17] - Jovic, Alan & Brkić, Karla & Bogunovic, N. (2014). An overview of free software tools for
general data mining. 1112-1117. 10.1109/MIPRO.2014.6859735.
[18] - WikiHow. (2019, Mar. 29). How to Increase Java Memory in Windows 7 [Online]. Available:
https://www.wikihow.com/Increase-Java-Memory-in-Windows-7
43
Appendix 1 - Provisional Project Contents
Contents
Abstract
1 - Introduction
1.1 - Aims
1.2 - Objectives
1.3 - Methodology
2 - Background
2.1 - General Analysis
2.2 - Motivation for undertaking the Project
3 - Literature Review
3.1 - Literature Review
3.1.1 - Dataset Collection
3.1.2 - Literature
3.2 - Techniques used by others
3.3 - Gathering of findings - never been done before
5 - Prototype Design
7 - Testing
9 - Conclusion
44
Acknowledgements
References
Appendices
45
Appendix 2 - Revised Work Plan
46
47