0% found this document useful (0 votes)
11 views48 pages

COC257-Commercial Applications of Vehicle Image Classification

vehicle

Uploaded by

shinobipalace69
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views48 pages

COC257-Commercial Applications of Vehicle Image Classification

vehicle

Uploaded by

shinobipalace69
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

BSc (Hons) Computer Science and Artificial Intelligence

18COC257

B625726

COMMERCIAL APPLICATIONS
OF VEHICLE IMAGE CLASSIFICATION
WITH DATA MINING

by

Ross D. Massie

Supervisor: Dr S. Saravi

Department of Computer Science


Loughborough University

April 2019
Abstract 3
Keywords 3
Acknowledgements 3

1 - Introduction 4
1.1 - Aims 4
1.2 - Objectives 4
1.3 - Methodology 4

2 - Background 5
2.1 - General Analysis 5
2.2 - Motivation for undertaking the Project 5

3 - Literature Review 6
3.1 - Material Review 6
3.1.1 - Datasets 6
3.1.2 - Literature and Academia 6
3.2 - Gathering of findings 7

4 - Comparison of Orange and Weka 7


4.1 - Properties and abilities of Orange and Weka 7
4.2 - Current implementation using Orange and Weka 7
4.2.1 - Orange 7
4.2.2 - Weka 8
4.3 - Visual interfaces 8
4.3.1 - Orange’s Visual Interface 8
4.3.2 - Weka’s Visual Interface 9
4.4 - Pros and Cons of Orange and Weka 10

5 - Progress and Challenges 10


5.1 - Orange General Testing 11
5.2 - Preparing Data for Orange Classification 12
5.2.1 - Formatting MATLAB to CSV 12
5.2.2 - Restructuring CSV 13
5.2.3 - Removing leading zeros 14
5.2.4 - Creating subfolders 15
5.3 - Preparing ARFF for Weka classification 21

1
5.3.1 - ARFF formatting issues 22
5.3.2 - Colour space issues 24
5.3.3 - Removing attributes and expanding memory allocation 28
5.4 - Running Orange Classification 29
5.5 - Comparison of Models 34
5.5.1 - Results of the first quarter of the data 35
5.5.2 - Results of the second quarter of the data 37
5.5.3 - Results of the third quarter of the data 39
5.5.4 - Results of the fourth quarter of the data 40
5.6 - Weka Classification 40

6 - Conclusion 41

7 - References 42

Appendix 1 - Provisional Project Contents 44

Appendix 2 - Revised Work Plan 46

2
Abstract
This research builds upon recent research into VMMR (Vehicle Make and Model Recognition) by
attempting to push these discoveries into the scope of commercial and practical applications. The
primary objective of this paper is to conceptualise, propose and implement a system which can be
utilised by an end user to perform safer automobile purchases. This is performed via recognising a
car’s model and make by a picture, then further proceeding to gather data information pertaining to
the car. This is then analysed, information is then presented to the end user of various metrics that
will provide a more accurate depiction of the car’s value. The market for this system would be on
sites such as Facebook’s “marketplace” and other car trading platforms.

Keywords
VMMR - Vehicle Make and Model Recognition

FOI - Freedom of Information

MPG - Miles Per Gallon

MSRP - Manufacturer’s Suggested Retail Price

AUC - Area under ROC, the area under the receiver-operating curve

CA - Classification Accuracy, the proportion of correctly classified examples

Precision - The proportion of true positives to positives, to measure false positives

Recall - The number of true positives amongst all positive instances in the data

F1 - A weighted harmonic mean of precision and recall

Confusion Matrix - A table which allows visualisation of the performance of an algorithm

Acknowledgements
While writing this dissertation, I have been fortunate to have much support on hand. Firstly, I would
like to thank my supervisor, Dr. S. Saravi, whose expertise was indispensable in the inception of the
research topic and choice of tools in particular.

In addition, I would like to thank my family for their wisdom and invaluable support, I couldn’t have
done this without you. Finally, there are my friends, whose constant presence assisted greatly in
ensuring my ability to complete the dissertation. Thank you all.

3
1 - Introduction
1.1 - Aims
The ultimate aim of this project is to create a system which can operate on a photo, recognising
what vehicle class is in the image. Once the class has been found, the system will search external
datasets, pulling back information on the vehicle that may be of use to the end user.

This system’s intended use is for users buying cars in person, from which they can near-immediately
receive a summary of several relevant statistics relating to that car and that type of car in general.

1.2 - Objectives
The objectives for this project are as follows:

1. Create project aims and objectives.


2. Gather datasets from the internet from sources such as the DVLA. Send FOI requests and
process answers.
3. Finalise the literature review on license plate recognition, car model classification and other
topics related to the project.

4. Analyse and learn to use Weka and Orange as platforms for image classification. Document
processes of how the software works and can be used in the pursuit of defined goals.

5. Produce solutions that fulfill the goal of the software, using Orange and Weka. Document
how these were produced.

6. Compare and contrast softwares to determine which is the most apt for the project,
including the pros and cons of each method.

7. Evaluate the findings of the project and whether the goals, aims and objectives have been
achieved.

1.3 - Methodology
This research study adopts a quantitative approach, emphasising a methodology that aims to solve a
real-world problem and prove this via numerical results to be evaluated. To perform this, research
into similar publications relating to how the image recognition of vehicle models is achieved. The
usage of this emphasis on secondary sources initially is appropriate as much of the theory into how
to achieve recognition has been concluded. Analysing these methods and building on them to
associate mined data regarding those recognised cars is to be primary research. To this end, various
papers and journals have been referenced and databases noted which have been of use in this
project.

4
2 - Background
2.1 - General Analysis
The problem addressed by this research originated from the emergence of the Facebook
marketplace and similar reasonably unregulated platforms for trading automobiles. It became clear
through a number of reports that a large proportion of cars sold on these platforms have been
involved in criminal activity or are otherwise scams.[1]

Additionally, there is an increased rate of people buying cars off the streets or from acquaintances
given the possibility for saving to be found in doing so. These methods are also far more unsafe than
traditional trading methods as the innate trust we have of people we know makes it more likely for a
person to buy a car without performing the necessary checks.[2]

To that end, the average person utilising these platforms could be said to have a much higher chance
of being scammed than if they used traditional trading methods such as car showrooms. This
problem exits as an inevitability of increasing interconnectivity coming with the rise of social
media.[3]

Finally, one nuance of the problem of in-person purchases outside the realms of the internet is that
there is normally a restricted time frame in which to agree to buy a product. Furthermore, it is
plausible that being seen to openly question and be suspicious of the validity of the product can
cause the vendor to become more aggressive. This requires for the solution to be usable covertly
and quickly, so that the investigation into the validity of the product isn’t necessarily perceived by
the vendor of the product.[4]

2.2 - Motivation for undertaking the Project


This project was undertaken as it can be said that a fundamental principle of capitalism that ignoring
the origin of potentially illegal goods is beneficial to the throughput of your platform. Due to this it is
often – rather counterintuitively – not always in the interests of traders to ensure that their
customers aren’t purchasing stolen goods or scam products. Therefore, providing a utility for
customers to use as a third-party product which can assist with verifying the real value of the
product they are purchasing would assist with offsetting this.

5
3 - Literature Review
3.1 - Material Review

3.1.1 - Datasets
This project has utilised several databases to facilitate the association of contained information to
the model of the scanned car. The first notable dataset is of 16,185 car images of 196 classes, which
will be used for training and testing the neural network itself, from Jonathan Krause et al[5]. This
dataset will be the basis for initial training and testing of the system as the content is substantial
enough to facilitate a high degree of accuracy.

There are also several datasets, which will be mined once the car’s model has been ascertained.
These will provide data regarding that model of car to the user, which may be useful in ensuring end
users will not become victims of scams.

One notable dataset is the Vehicle Safety Branch Recalls Database, provided by Data.gov[6]. This
dataset contains all the safety recalls issued by manufacturers circa 12 December 2013. This data will
be vital as the DVSA has stated that 1 in 13 used cars have safety recall issues. Additionally, owning a
car with such a recall could net a £2,500 fine. Therefore, buying such a car would be very ill advised
and this danger must be accounted for in the developed system.

Another dataset is the Car Theft Index, provided by Data.gov[7]. The data was last updated circa 9
February 2010, meaning that the data is likely of less use than the above datasets, however it will
likely still be of some utility. The dataset itself holds data from 2005 listing the cars which are most
likely to be stolen

The Anonymised MOT tests and results, provided by Data.gov[8], is to be implemented into the
system. This will be a very interesting dataset to use as the potential for mining is large, the data
dating from 2016 back to 2005. It is composed of MOT test results, including information about what
cars get what sort of MOT failures.

Two datasets publicly available on the website Kaggle have been gathered for use in the project. One
dataset is related to the estimated price of the second hand car scanned[9]. The second dataset is
related more generally to the original pricing of the scanned car and its general features[10].

3.1.2 - Literature and Academia


Moving onto the research literature surrounding the topic, the first paper referenced is the work of
Saravi and ​Edirisinghe on vehicle make and model recognition[11]. Their work is notable as their
proposed system yields over 95% accuracy of classification. As they state, the current methods of
distinguishing a counterfeit vehicle are heavily reliant on the number plate, which can be forged.

By using vehicle make and model recognition more widely, this system can augment simply checking
the number plate, which the DVLA advises. They write regarding regions of interest, which is used as
a defining feature of how the cars are identified, which is similar to how the system proposed in this
paper is expected to distinguish car models and makes.

Within the paper of Hirudkar and ​Sherekar regarding comparing data mining tools, they concluded
Weka was “highly robust for a variety of users”, which prompted its usage in this project[12]. Orange
was also included in their work, which can “run various types of statistical tests and analyses and

6
create charts and graphs for the results”[13]. It was therefore chosen as the second software to trial
in regards to this project.

Lee H.J. spoke in their paper of how model and manufacturer identification needed further research
and that license plate recognition has been tried by many[14]. Their suggestion was a three-layer
back propagation network for scanning of number plates. In Rob Hull’s article for This is Money, he
noted how the DVSA reports that 1 in 13 used vehicles have a pending safety recall[15].

3.2 - Gathering of findings


The literature present on the subject of relating datasets to vehicles that have been previously
classified is non existent in my review. This paper’s objective is to close the gap by extending current
research regarding classifying vehicles by relating databases and datasets to the software’s result.
Past research has shown that creating systems to classify vehicle images into vehicle makes and
models has been successful.

4 - Comparison of Orange and Weka


4.1 - Properties and abilities of Orange and Weka
Orange[13] and Weka[16] are the two systems that will be used to develop a solution for this
research. As Arpita M. Hirudkar and Mrs. S. S. Sherekar stated in their paper[12], Orange is a
component-based system while Weka is an open-source collection of data mining tools. ​Jovic et al.
studied a number of data mining tools, from which Orange and Weka were both categorized as
having classification abilities​[17].

4.2 - Current implementation using Orange and Weka

4.2.1 - Orange
One software for artificial intelligence is Orange, a Python suite. This project’s end goal and results
will first be attempted to be reached using this software. Primarily, the image recognition neural
network features. The method for producing image classification is straightforward. Orange’s user
interface is intuitive, using visual aids that you can link to each other with lines to symbolise data
flow.

To do so, the “Import Images” widget needs to be selected and moved into the operating area where
components can be linked. From there, this widget is edited to select the folder containing the
images to be used. The second widget is then added, “Image Embedding”, which runs the pictures
through a neural network extrapolating data from the images. The two widgets are linked, then the
embedded images are linked to “Test & Score”, which will attempt to evaluate the images to their
classifications based on this embedded data. A further widget, from the group “Model”, such as
“Logical Regression” is added to the “Test & Score” widget. This acts as the learner for “Test &
Score” as the way it will evaluate the images. Finally, a “Confusion Matrix” widget is added to “Test
& Score”, so that accuracy and evaluation may be performed after the fact.

7
Figure 1 - Example Orange workflow

4.2.2 - Weka
Weka as a software is centred towards data mining much more than it is towards image classification
and object recognition[16]. To use image classification features, the package for doing so needs to
be added on separately. The method for using images requires all the images to be within the same
folder. An ARFF file needs to be generated, which includes the image’s file name and the
classification corresponding to it. These two attributes are required to progress further. By adding
image filters to each image, additional attributes can be extracted out from the images, which are
numeric.

Weka filters are stored under unsupervised/instance/imagefilter. A filter is selected, such as


“colourlayout”, which is then directed to the image directory. Once all the images have been read,
the features will be extracted out and added to the dataset. By using Weka’s “Classify” tab, a
classifier can be selected, which will attempt to perform image classification. An example used is J48.

More than one filter may be applied in succession to yield further image attributes generated. This
may increase or decrease the accuracy of the classification depending on the data the software is
trying to classify from.

4.3 - Visual interfaces

4.3.1 - Orange’s Visual Interface


The visual interface of Orange focuses on a single window, from which nodes are added and linked.
These are arranged towards the left-hand of the screen, which allows for easy access and
visualisation of what is available for usage. Once items from each group have been added to the
working area, they are coloured depending on which category they originate from for ease of
interpretation. These individual nodes can then be linked to each other by clicking and dragging lines
between them. This prompts text to attach to the line also, presenting “Data -> Images” and similar
notes depending on what is being exchanged or converted between nodes. Annotations can be
added to the working space, which do not contribute or interfere with processes on screen. These
are useful to follow processes and note future development micro-goals.

8
The settings can be found towards the top of the screen, along the menu bar. The nodes, links
between them and other preferences can be edited. Addons may also be installed, which can further
the functionality of Orange. Output logging can be extended to the level of debugging, so the
dataflow can be observed in varying levels of detail. Helpful example pieces and YouTube tutorials
can also be seen in the “Help” section of the interface.

4.3.2 - Weka’s Visual Interface


Weka’s visual interface is a menu-style GUI, including several buttons to the right hand side of the
main panel. Each yield a unique application for usage with various data mining applications. The
primary settings can be found in the software’s top menu, including diagnostic tools such as memory
usage and logging. Additionally, there is a section of the menu dedicated to visualisation, allowing
the data to be 5 separate ways, such as a tree or boundary graph. There is an accessible package
manager as well as an ARFF and a SQL viewer. The “Help” section includes how-to guides and code
snippets to increase ease of use for beginners.

The 5 applications that can be selected on the right hand side of the main window are varied, such as
an “experimenter” and a “workbench”. The “explorer” is the first of the five applications displayed. It
appears to be a window which includes the ability to open files, databases and URL links to data.
Filters may be applied to extrapolate additional data for analysis from the imported data. This
includes filters for both supervised and unsupervised purposes.

This “explorer” also includes additional tabs towards the top of the window. These include methods
to specify classifiers, cross validation folding and whether to specify a certain amount of data for
training. Clusterers and associators can also be used, which similarly can be split into usage with
training data and test data. Selection of attributes can be performed with attribute evaluators and
search method selection, such as subset evaluation and bestfirst searching. Finally, visualisation
tools are available from this application, allowing plot matrices to be calculated and displayed with
colouration depending on the class chosen to be represented.

“Experiment” is the second application displayed in the splash menu. A variety of experiment types
are available to be performed, such as cross-validation. The output for these experiments can be
selected, as can the output format, such as a ARFF file. Iteration control can be performed, so that
the experiment is carried out a certain number of times before the software has completed its goal.
Datasets and algorithms are available to be added, loaded and saved within the setup interface,
allowing for very specific targeting of certain processes for specific portions of data. Once the
experiment has been set up, the experiment can be run under those parameters within the “run”
tab. Additionally, the “analyse” tab specific sources can be imported for testing. These tests can be
configured with a high degree of accuracy, including certain rows and columns, the field to be
compared and the significance of that field. The output format can be changed. Standard deviations
and certain columns can be highlighted from the output, which is given in a clearly laid out window
to the right of the application.

The “KnowledgeFlow” application is laid out in a more list-like format than the previous applications.
It contains a interface into which nodes can be inserted from the left-side list containing categories.
Various types of nodes may be inserted, such as sources for data to be loaded, such as “ArffLoader”,
which loads ARFF files. Further, nodes such as classifiers, clusterers and evaluation nodes can be
loaded into the workspace. By double-clicking nodes it is possible to configure them individually to a
surprising degree of detail.

9
“Simple CLI” opens up a command line style window, which at time of writing has 10 commands
documented for use. The main usages seem to be listing the capabilities of classes within Java,
managing variables, executing and killing jobs.

“Workbench” seems to incorporate all the above applications into one window. A tab-style system is
placed at the top of the application, from which any of the aforementioned applications can be
switched to within the window, such as “Simple CLI”.

4.4 - Pros and Cons of Orange and Weka


The pros and cons of each software in comparison to each other is a useful topic to touch on for the
purposes of this project. Orange’s simple to use design is interesting compared to Weka. Weka itself
is more complex by virtue of having a number of more applications associated with its usage. The
visual representation of nodes in Orange’s interface is more colourful than that of Weka, Orange
featuring colour-coded categories which can make it easier for the end user to know what inputs and
outputs to expect at a glance. Weka opts for a more complicated system of categorisation of nodes
in their respective comparison to Orange.

The all-in-one design of Orange is useful in the fact that you can clearly see all the processes being
undertaken on one screen, making usage for a beginner possibly easier than may be the case for
other alternatives. Weka itself having more functionality, which happens to be spread across more
applications and screens than that of Orange.

The linking process in Orange between nodes is very intuitive, with a simple click-and-drag system
being at the centre of how to connect the various components of a data mining functionality. Weka,
Orange’s ability to show what data is being transferred by connections clearly is also incredibly
useful for realising when data needs to be interpreted or converted before usage.

Weka appears to be a fully-equipped data mining tool without any further additions, however
Orange’s addon system is much more heavily used than that of Weka. This means, on the other
hand, that Orange can potentially be tailored for purpose with less unused functionality than Weka.

Orange seems to contain a number of visualisation tools, such as graphs and plots, more of which
can be imported by addons. Numerically, Weka seems to have an advantage in terms of visualisation
of data and analysis of results after the fact. Weka’s confusion matrix, for example, has a greater
amount of content than Orange’s counterpart confusion matrix.

5 - Progress and Challenges


Progress made so far in the project has been in-keeping with the time expectations made at the
beginning of the year. Several datasets have been found which will ultimately be used and
interpreted with the final system. This has been the result of messaging several governmental
organisations and filing freedom of information requests.

Orange and Weka were the chosen software to evaluate. By browsing video tutorials and online
resources, progress has been made to continue creating a pair of neural networks which can then be
selected between. This has been easier with Orange than with Weka, as Weka is a much less reliable
software compared to Orange’s newer and more supported product.

The project has encountered few difficulties. One notable problem was the lack of research and
papers in applying technology to real-life use cases, at least in the field of image recognition. This

10
rendered the literature review somewhat difficult to complete as it almost entirely consists of a “gap
analysis” and mentions of inspiration from these somewhat-similar papers.

Additionally, the datasets procured are generally out of date, as many datasets held by the DVLA and
other bodies requires several thousands of pounds to purchase.

Windows itself as an operating system has posed several problems in the process of executing this
project. MATLAB’s file extension is identically identified as a Microsoft Access Database Shortcut.
This was an issue when attempting to open and evaluate classification information related to the
training images. Microsoft Access attempts to open the data, cannot do so, then proceeds to show a
new Access project. This was rectified by installing MATLAB.

As of now, in terms of changes of direction, there haven’t been many notable deviations from the
original plan. The viability of Weka being used given its age relative to newer technologies has been
caused the balance of prototype development to shift in favour of Orange.

5.1 - Orange General Testing

Figure 2 - Raw car images sample

To test Orange on the set of all car images, it was the case to test how Orange would work handling
so much data. Therefore, to do so, a simple hierarchical clustering algorithm was implemented to
check the utility of the software. This was in an effort to benchmark the performance of Orange as a
program before proceeding to create the classifiers required.

The main issue that was perceived in using Orange was that the program did not save progress and
every time a project was imported with the “Import Images” widget, this would cause the entire
workflow to be recalculated. This was an issue as in later stages of the project, when the entire set of
16,185 images are being imported and computed this could take up to around 18 hours.

Orange has to calculate such embeddings by sending the relevant data across to their servers, which
then remotely generate the image embeddings. Similarly, the models to be trained as trained
remotely, not on the host PC. This was an issue, as the PC being used for this project was a top-grade
consumer computer and by running the computations locally, the time to complete of these
embeddings and training on the subsequent embeddings could have been drastically faster by
cutting out communication time.

11
Figure 3 - Hierarchical clustering test workflow

The hierarchical algorithm completed without much issue, yielding some useful insights into how
similar and how varied the various car images are. With the hierarchical algorithm in place, it was
running with a low accuracy in clustering, given that the algorithm in question was not specialised
properly for clustering such varied images. The difference in lighting, positioning and the subtle
difference between different types of car from the same manufacturer - all these cause the accuracy
to suffer.

It was around this point that it became clear that the source data was definitely varied enough that if
the resultant models had high accuracy, the model could be used in a wide variety of environmental
conditions and variations. This subsequently hints to the models developed bearing significant
real-life usage as opposed to only theoretical accuracy.

5.2 - Preparing Data for Orange Classification


Moving on from the hierarchical clustering as the information gathering was significant from its
outcome, the methodology for creating the final models was the next priority. The first issue was the
current Orange workflow and structuring of source data. The classifier algorithms, unlike the
clustering algorithms required a major shift in the layout of the input data which lead to the source
data having to be majorly restructured.

The way by which Orange derives what the appropriate classes it is training for within the classifier
it’s making is by file structure. This contrasts with other programs such as Weka, which oftentimes
use config files to do so.

In this case, the ~16,500 unsorted images within the dataset would need to be classified into a file
structure which included 196 child folders, one for each class. From there, the entire root folder can
be selected and Orange can determine how to train the neural network on which classes and which
predictions are correct. This presented an issue, as the source images were not pre-sorted upon
download.

To do so, the annotations for which image belongs to which class were required. From this data it
would be possible to resort the source data image by image into the appropriate directory. The issue
with this was that the data was entirely within MATLAB files. However, this was not in an
appropriate format to access to facilitate reading and reorganising many tens of thousands of images
well.

12
5.2.1 - Formatting MATLAB to CSV
It was now the case that the MATLAB format needed to be converted into a more code-friendly
format. This was relatively simple, requiring that the annotations be opened within MATLAB then
converted by saving the files as .csv. This meant that the files could be opened within Excel to be
viewed more conveniently and imported into coding purposes.

However, the way the MATLAB annotations converted to Excel-readable format was not perfect or
suitable to be iterated through in a logical manner. Therefore, it was necessary to perform some
manipulation of the table to sort the information in a more optimal format. Doubly, the class names
themselves were stored separately from the actual pictures they related to. Instead, a second file
was used which assigned numbers to which class each picture related to. For example, “000001.jpg”
is associated with class “1” instead of directly relating to the name of the class itself.

5.2.2 - Restructuring CSV


This needed to be addressed to properly sort the source images into their constituent subfolders
correctly. The first issue was that the car images and their numerical class number did not have
proper headings for the Python-based algorithm to sort the data to use. This was easily rectified by
creating appropriate headers. This can be seen in the figure to the left.

Further, the file containing the string names of the classes was structured in such a way that there
was a single row in the Excel file, which contained all the names of the classes cell by cell. An
example of this can be seen below. The main concern with this is that it would make iterating
through these names - once again - more difficult than it needed to be. By transposing the data to a
single-column format, similar to how the image-class figure to the left is structured, this proved to
make the required sorting algorithm easier to implement.

13
Figure 4 - File Name to class number CSV

Figure 5 - Unchanged class name spreadsheet sample

14
Figure 6 - Translated class name spreadsheet

5.2.3 - Removing leading zeros


The next issue was that the car images the data applied to had odd numbers of leading zeros, which
would make iterating through the images one by one systematically difficult to account for. By
utilising Python, several custom scripts were created for this project to facilitate the reorganisation
of these images into usable format. The first of these strips any leading zeros out from the filenames.
This gives the ability to iterate easily through over 16,000 images without having to count for
differing numbers of leading zeros. The code for this can be seen below.

Figure 7 - Code to remove leading zeros

15
Figure 8 - Unaltered car image names Figure 9 - Altered car image names

This performed well for such a simple solution, easily cutting out the irregular numbers of leading
zeros which could prove problematic to account for in a simple iterative program to sort these
thousands of pictures into the necessary subfolders for the Orange classification process.

5.2.4 - Creating subfolders


The method of creating subfolders I decided to use was once again with Python. The goal of this
algorithm would be to iterate through the CSV containing the file names and class numbers one row
at a time. It would sort each named picture into a subfolder corresponding to the name of its class.
This would be achieved by tracking the point when the class number has changed from the previous
row. This allows the algorithm to look up in the CSV containing the class names, which subfolder has
not yet been created within the root directory. The program then creates this subfolder, continuing
to iterate through the photos row by row until another change in the class number is perceived. This
will prompt another subfolder to be created within the root directory and the process repeats for
every car image.

However, there were some issues with coding this iterative algorithm. As seen below, there were
several problems regarding errors in the dataflow. One of these was that the Python method of
transferring one file between folders. The OS module is to be used, along with a specific way of

16
designating the directory, as shown below. This caused a slight slow in development as it was
worked through and completed upon.

Figure 11 - Erroring code to divide car images into subfolders by class

Furthermore, there were several other issues that presented themselves. As visible below, the
console log shows that the code fails to recognise when the next directory should be created - i.e.
when the type of the previous image have all been sorted. The feedback message of the first
directory being created is received, however the code then iterates through only the first 196
images, believing that each picture should be its own class. This was rectified by fixing the lookup
section of the code, where the image’s class is checked before assuming that the program should

17
cycle to the next class. This prevented the mismatch of false classification and the creation of only
one subfolder as seen.

Figure 12 - Code to divide car images into subfolders by class failing to create multiple subfolders

At this point the algorithm was corrected further, as seen below. In addition, as you have seen in the
figures, the filenames have been corrected to remove spaces and insert underscores. This assists
with creating the necessary subfolders, as Windows does not accept spaces within directory naming
with certainty all the time. There was another small issue in the naming convention of some of the
classes. Certain classes of car included forward slashes and backslashes in their names. This posed
difficulties for composing the string name for the folders correctly, which needed to be corrected
manually in several areas of the naming CSV file.

18
Figure 13 - Working code to divide car images into subfolders by class

Furthermore, Figure 14 is a snapshot of the end product and code used to sort the photos correctly
into the 196 separate classes. By fixing the way the addressing string was composed, it was possible
to rectify the issue of only one directory being made in the previous version of the code.
Additionally, the results of this code can be seen in Figure 14. The algorithm successfully sorted
many thousands of images into unique subfolders, according to the categorisation of two CSV files.

19
Figure 14 - Final code to divide car images into subfolders by class with output log

20
Figure 15 - Car images divided into subfolders by class

5.3 - Preparing ARFF for Weka classification


As no ARFF file existed for this dataset, it was necessary to compile a unique ARFF file to work with
Weka’s systems so that the data could be imported correctly. This required some research to
accomplish, such as how to compose the attributes required to specify how the class names related
to the images names, as well as the classes themselves.

As you can see below, this required a very particular layout within the file. The “relation” association
is required within the ARFF for the configuration to be loaded correctly, which is visible at the top as
“car_ims_master”. Secondarily, the “filename” attribute needed to be inputted as a string, so that
Weka knew which files within the chosen directory it needed to ingest and to have unique identifiers
to relate to the classes assigned.

The “data” attribute is arguably the most important attribute to define within the ARFF file. It is
where the previously defined attributes are related to each other and the artificial intelligence
algorithms can accurately calculate their accuracy. As you can see, the data has to be structured in a
very specific way. First, the full filename and extension needs to be used. This is so that Weka knows
which file corresponds to which associated data. Afterwards, the classification of that data is added.

21
Figure 16 - Custom ARFF for the car images

5.3.1 - ARFF formatting issues


A small issue with the ARFF format came from the class attribute which was composed for the data
from the original list of the classes from the MATLAB files. The source wrapped each class in single
quotes, which when inserted into the ARFF file within the long array of classes refused to work. By
removing this wrapping the classes were more easily accepted into Weka.

Furthermore, there were several instances where the Acura classes of car had naming conventions
when formatted as you see within the ARFF file, caused conflicting identical class names. This
needed to be rectified by singling out the problem classes and renaming them. These renaming
changes were then migrated across to all the necessary settings to change in Weka.

Similarly, Weka did not want to accept classes with erroneous characters, so the character such as
the slash in “C/V” needed to be replaced with “CV” along with several other instances of special
characters being used within the class names.

22
Figure 17 - Code for constructing custom ARFF file

The program for extracting, formatting and renaming the data for constructing the ARFF file is
located visible in Figure 17. Figure 18 is a figure of the raw input afterwards, with the classes
correctly added to the necessary image names.

23
Figure 18 - Final ARFF file, having converted JPG to PNG

5.3.2 - Colour space issues


Weka does not natively handle the usage of image data within its workflow. To allow this, the
ImageFilters plugin was required to be installed within Weka’s package manager. This caused several
issues to be encountered while initially inputting the data into Weka for filters to be applied.

24
Figure 19 - Importing the Image Filters package into Weka

One of the major issues encountered was the fact that when the ARFF file was loaded into the
program, an error regarding incorrect colour spacing and rasters would show. This caused many
issues as it is not a Weka-native error, hence the reports of similar errors online related to specific
errors in Python. As changing the source code of Weka was definitely outside the scope of the
project, it was clear that there had to be a notable change to the source data to attempt to solve this
error.

After consulting with experts in Weka, it was decided that the source data was required to be
converted to another colour space. As a elegant solution to this a Python script was written to cycle
through all of the pictures and convert them from JPEG to PNG format.

Figure 20 - Algorithm to convert car images from JPG to PNG

25
This worked to solve the error, allowing the data to be imported into Weka, as seen within Figure 21.
The filename and class attributes are clearly visible, along with the 16,185 instances of various car
pictures. From here, it was possible to move further ahead and begin to run filters to extrapolate
data for Weka to classify upon.

Figure 21 - Weka interface after loading ARFF, with filename and class visible

26
To navigate to the filters which are able to be run upon images, the “Choose” button must be
pressed and the chosen filter selected from the list. As the images are not in standard format by
which they can be processed, only special filters designed for this purpose from the Image Filters
plugin can be used. This can be seen below - this has a large list of filters listed under unsupervised
and the instance subset of unsupervised algorithms.

Figure 22 - Selecting the image filters from the interface

As you can see within Figure 23, each of the 195 distinct classes have been successfully imported. It
is also the case in the figure that a suitable filter has been chosen, along with the directory set to
that of the unsorted PNG images folder.

27
Figure 23 - Visual graph of the 196 classes

5.3.3 - Removing attributes and expanding memory allocation


Here is an example of applying a filter to the data and the resultant output. The final step to
preprocess the data is to select and remove the “filename” string variable we started with. This is
because it interferes with the classification of the data, as many cannot utilise string variables. A new
issue was encountered during this step. Running Weka consistently caused the memory allocation
for Java to be reached, crashing the program immediately. The correct step to fix this problem was
to increase the Java environment variables to expand memory allocation to more apt amounts.[18]

28
Figure 24 - The additional image features extracted from the filter used

5.4 - Running Orange Classification


Here are the workflows for Orange that have been created for the purpose of fulfilling the needs of
the project. It was more efficient overall to initially train the models on all the data at the same time
in terms of time. This posed many challenges. For clarification, as you can see below, the Neural
Network and other models are being trained on images which have been imported and embedded
with features that can be trained on. Once a score is output, the accuracy of each model over the
dataset can be determined. Additionally, the blue widgets right of Test & Score allow for further
analysis. For the purposes of this report, these were used for debugging as inserting figures into the
report from these outputs would be difficult and hard to interpret.

29
Figure 25 - First workflow designed and used

There were several issues regarding the Figure 25 workflow. Primarily, by training all the models at
once the system required a huge amount of computing resources and time to complete. To this end,
the workflow was split down into subsections, one of which is visible within Figure 26. Additionally,
the source data was cut down to be able to be utilised within a 16GB RAM capability. This posed a
few problems, having trained the models on the whole dataset, Orange cannot Test & Score models
after the fact, for example when loading the model for predictions. To that end, for purposes of
testing the efficiency of the algorithms, it was only possible to evaluate the models against each
other during the training workflow. This means that the actual accuracy of the fully trained models
was unable to be documented, however it was the case that models trained using a quarter of the
data have been created to compare the relative accuracies and capabilities of each of the types of
models trained and used.

30
Figure 26 - Cut down training/testing workflow for performance improvement

Prior to restructuring and removing the unnecessary parts of the workflow to result with the Figure
26 outcome, you may see below the full workflow that was used for training and debugging the
original models. The data flowing through the workflow was saved at regular intervals during
development, in case it was required for backend comparisons and processing. Ultimately, there was
simply too much data to process saved from these stages to process into meaningful statistics that
aren’t already included in this project.

31
Figure 27 - Full diagnostic and model saving workflow

Here we move onto the basic building block from where the actual end user predictions on user
images will be implemented from. By loading a model and passing through the embedded image
data into Predictions, each class is designated a certain percentage of the time that it is classified as
a given model. Additionally, the class which the model is most confident that the image is will be
outputted as the primary predicted class.

Figure 28 - Prediction mechanism using trained models on inputted car image data

32
Figure 29 is the extrapolation of this, including the implementation of this technology into a manner
by which the end users may find useful. Useful information is loaded into the workflow, such as
miles per gallon, msrp price and other statistics. From there, the data is merged and concatenated
into the table outputted from the prediction which includes the predicted class. This is done by
cross-checking the predicted class with the names of the cars in the imported dataset. Finally, the
table and data is stripped down. This removes all the features that were used from the feature
embedding and other data which is not useful.

Figure 29 - Functional workflow including external dataset association

The formation of the workflow took a lot of trial and error to complete. The method of merging the
data took a good amount of manual work, using Excel to concatenate and strip data to a field which
could be compared exactly to the predicted class. Additionally, making sure the useful data that was
concatenated to the table wasn’t stripped out was a challenge, as there was little way to discern
how to do this other than trying all the various options available within the widget. Furthermore, the
output needs to be filtered to only include the information for the class which was predicted, shown
below is one of the steps needed for such processing.

33
Figure 30 - Tool to select only the data for the image entered into the system

5.5 - Comparison of Models


This project is designed to evaluate a broad array of different models to best improve consistent
prediction capabilities and therefore end user experience with the system. As previously stated, it
was difficult to train the models on more than a quarter of the entire collection of car photos.
Therefore, the statistics in Figure 31 for each model are not as high as they would be with a stronger
computational base on which to train them. However, this comparison is to show which type of
model will be better at classification than the others. Furthermore, it may present which classes will
have problems being classified upon.

The inputs and outputs of the system are uniform and logical. The input in its rawest form is the
>16,000 unsorted car images inside one folder. This is then manipulated by a piece of
custom-written Python code, which strips the unnecessary leading zeros from each of the images’
names. Once this has completed, another piece of custom code loads two Excel files, scanning
through the one which holds the filenames of each image. The class number associated with that
image is then cross-referenced with the other Excel file which holds the association of class numbers
to class names.

The code then creates a subfolder, with a name extracted from the aforementioned Excel file. The
name within the Excel for the given class is formatted for use in windows folder names, such as
removing spaces and replacing them with underscores. Once the folder is created, the code will
continue iterating through the list of filenames for each image and their associated class numbers.
All images will be sorted into the folder which has the title that is the same as their class name.

Once sorted into their respective subfolders, the whole root folder is passed into Orange, which uses
the subfolder directories to denote classes within its framework. This is what the various models are
trained and tested upon, after the images are individually passed through an embedding widget
which extracts additional features to use within the classifier. Given processing power available,
dividing the full amount of images into quarters was best to give accurate testing for the classifiers.
The output from this process is four models (Logistic Regression, Support Vector Machine, Neural
Network and Naive Bayes) for each quarter of the data.

34
These models can then be evaluated against each other, using the confusion matrices and other
metrics noted below to come to a decision on which model is best for real world usage. To run
predictions, the image to be classified is run through the embedding component to extract features
to classify against. The model to classify with is loaded into the predictions component, the output
from this portion of the process is the class which the model believes the image belongs to, or each
image belongs to if multiple are loaded.

Once this class is decided, the class name is passed forwards to a component which merges that
information with the loaded dataset to associate with the class. Once this is further input to a
component which removes all the unnecessary cells which were merged from the loaded dataset.
This outputs a single row of data for each car image input to the system as a whole. This row will not
only include the classification of the image but also any useful information pertaining to that class
from the dataset.

As for the metrics noted in the tables within this section of this paper, the abbreviations are as
follows. AUC is defined as Area under ROC, the area under the receiver-operating curve. CA is
defined as Classification Accuracy, the proportion of correctly classified examples. Precision is the
proportion of true positives to positives, to measure false positives. Recall is the number of true
positives amongst all positive instances in the data. F1 is a weighted harmonic mean of precision and
recall.

In particular, the confusion matrix is the analytical metric that is most used within this research. It
shows the percentage of misclassifications for any given class across all classes. This is useful for
deducing which classes are most prone to being misclassified and which classes are the ones that
cause this misclassification.

5.5.1 - Results of the first quarter of the data


The table included within this section shows useful analysis based upon the aptitude of each model
to classify. The most important field to consider is “precision”. This is the proportion of correctly
classified images to incorrectly classified images. However, it is important to remember that the
prediction chooses from between 196 classes, many of which have visually similar models. This
means with additional work to the classes chosen in future, this accuracy could increase
substantially.

As seen here, the most accurate model is Logistic Regression. Conversely, Naive Bayes is the least
accurate of the models. The difference in accuracy between these models is around 20%, as derived
from the below metrics of CA (Classification Accuracy), Precision and Recall. The maximum accuracy
of any model is that of logistic regression, which for the first quarter of the data is around 53%.
Therefore, the Naive Bayes solution is nearly 35% worse as a solution than Logistic Regression.

Figure 31 - Accuracy and metrics table for the first quarter of the car images

35
Figure 32 - Confusion matrix of neural network model for the first quarter of the car images

Figure 33 - Confusion matrix of support vector machine model for the first quarter of the car images

Figure 34 - Confusion matrix of logistic regression model for the first quarter of the car images

There are some very interesting findings to be extracted from the confusion matrices in Figure 32, 33
and 34. The ordering below of the tables is as such, Neural Network, Support Vector Machine,
Logistic Regression. It was not possible within Orange to calculate a confusion matrix with Naive
Bayes. Across all results, it is clear that cars of the same model are generally difficult to distinguish

36
from each other. Alternatively, cars which are distinctive and had no other cars of the same brand
within the model’s trained classes are nearing total certainty of prediction.

A great example of this is how the AM Hummer has a very distinctive shape of car with no cars of
similar make within the model’s classes. Therefore, as visible in the above tables is how the accuracy
is near 100%, with only 1.1% uncertainty. Alternatively, the Acura RL Sedan 2012, Acura TL Sedan
2012 and Acura TSX Sedan 2012 have very prominent scattering within the confusion matrix. Ideally,
the names of these would be more distinguishable on the tables, however Orange’s scaling limited
the names as seen.

Specifically, the fifth model of Acura down the table shows this principle of how the similar models
within the classifier can cause issues regarding cars. As you can see, the third model of Acura listed
confuses the classification of the fifth model of Acura, representing nearly 20% of misclassifications
of that model. Furthermore, the second model of Acura accounts for over 6% of misclassification and
the sixth model represents nearly 5%. This totals to around 30% of the otherwise very scattered
misclassifications.

5.5.2 - Results of the second quarter of the data

Figure 35 - Accuracy and metrics table for the second quarter of the car images

The findings from the first quarter are backed up once more with the results from the second
quarter of the data. The best solution is clearly logistic regression, the worst being naive bayes. The
matrices below are in the order from Figure 35.

Figure 36 - Confusion matrix of support vector machine model for the second quarter of the car images

37
Figure 37 - Confusion matrix of neural network model for the second quarter of the car images

Figure 38 - Confusion matrix of naive bayes model for the second quarter of the car images

Figure 39 - Confusion matrix of logistic regression model for the second quarter of the car images

Similarly to the first quarter of usable data, this is a small slice of the around 50 classes trained on
each quarter of the data. Much like as we saw from the results of the first quarter, there is clearly no
uncertainty upon the models being able to correctly classify the car’s model and make regardless of
the weather conditions, lighting, colour and angle of the vehicle. There are some cases within Figure
31 where models of vehicle that appear very similar to other models cause misclassification which is
notable. However, so far we have yet to see this cause more misclassification into a certain incorrect
class than the correct class. The most prominent example of this is the logistic regression algorithm

38
in Figure 31, the first Chevrolet inside the table is misclassified as the second Cadillac almost half as
much as it is correctly classified.

5.5.3 - Results of the third quarter of the data

Figure 40 - Accuracy and metrics table for the third quarter of the car images

The first half of the data has shown that Logistic Regression is the most accurate of the models, so
for the final half this will be the matrix focused on primarily. In the case of this third quarter, the top
two most accurate models - Logistic Regression and Neural Network, Logistic Regression yields a
model ~5% more accurate than the Neural Network. This is much greater than the Naive Bayes
model, ~23% more accurate and classifies correctly nearly 50% more of the time than Naive Bayes.

Figure 41 - Confusion matrix of logistic regression model for the third quarter of the car images

The confusion matrix in Figure 41 is a great example of how more distinctive shapes of cars allow
easier, more accurate classification than more generic silhouettes. The Fiat 500 and most Ferrari
models are very distinctive from the majority of the other cars classified within this third quarter of
the car image data. Even with three Ferrari types within the model, the scattering that would have
occurred with a less distinctive model did not occur.

Ordinarily in this case, one or several percent of the total misclassification instances would be
scattered across a number of classes. With a very distinctive visual appearance for a type of car, this
does not occur nearly as much. An example within Figure 41 is the Fisker, which has 5.7%
misclassification even within the sample of six classes chosen for this confusion matrix - this error
being spread across just three of them.

39
5.5.4 - Results of the fourth quarter of the data

Figure 42 - Accuracy and metrics table for the fourth quarter of the car images

The fourth quarter of data shows a very large anomaly compared to the other three quarters. The
Naive Bayes model has not handled modelling well with this quarter, showing a drop of 40%
accuracy compared to the third quarter to under 10% accuracy. This is an outlier from the other
three quarters of the car images, which shows the value of analysing the data in sections rather than
as a whole.

Figure 43 - Confusion matrix of logistic regression model for the fourth quarter of the car images

It is clear from the confusion matrices in this section that there are trends within the data. If the car
looks very similar to other types of car, there will be a large error towards these classes, reducing the
total correct proportion of classifications. Another trend identified within the data is that having a
large number of very varied classes, such as the ~50 classes per quarter, results in a notable
cumulative error. In many cases, a few percent of the total errors will be dispersed within a number
of classes, increasing the error with the number of classes within the model.

5.6 - Weka Classification


It was at this point that Weka was classified upon the extracted data as referenced in 5.3. It was
clear that there would be no intuitive method to add data from external sources such as miles per
gallon onto the output data from Weka. As Orange has worked in this regard, fulfilling the
requirements of the project, Weka was developed further exclusively for the purposes of this
concluding its explanation in this report.

I believe The process depicted within this research can be of use to academics attempting to use
Weka for image analysis. The methods explored within this research can be used to counter errors
that may present from the usage of Weka in this regard, saving time otherwise unnecessarily used to
correct such errors.

40
6 - Conclusion
The research conclusion is that it is possible to create a solution within Orange which allows a car
image to be analysed, a prediction of the make and model of that car to be generated with
reasonable accuracy and this prediction to then be associated with external datasets. In particular,
an SVM solution is the most apt for real life usage.

This can be data that may influence product purchasing, such as MSRP and MPG of the car type
predicted. For end users, this process only takes a number of seconds and can be used with
consistent accuracy.

Ultimately, below is an example, real output of the working Orange system. The amount of data
added is much more than displayed, however it would be infeasible to input such as a figure as it
would become unreadable.

Figure 44 - Example system output

This system will be of use to legitimately help many people in the process of purchasing new vehicles
in person from less than reputable sources. The usage of running a picture through this
Python-based workflow and having the system return useful information regarding that car within a
few seconds is tangible.

The development of this solution required a large amount of independent research, including the
learning and understanding of two seperate data mining frameworks - Weka and Orange. The niches
and specific information to properly utilise both systems required extensive testing and iterative
correction of solutions.

The flow of data through the system is as follows. More than 16,000 unsorted car images are the
source data, which is then manipulated by a piece of custom-written Python code, which removes
the leading zeros from each of the images’ names. When completed, another secondary custom
program loads two Excel files, iterating across the first spreadsheet,which contains the filenames of
each image. The class number related to that image is checked against the other Excel file,
containing the relationships of class numbers to class names.

This program creates a subfolder for each class name found within the relationships Excel file. Each
class name found is converted for use in windows folder naming convention, such as removing
spaces and replacing them with underscores. The car images are then sorted into the folder which
has the title equivalent to their class name.

When sorted into their respective subdirectories, the root folder is loaded into Orange, using the
subfolder directories to separate the contained images into classes within its framework. The various
models are trained and tested upon this, once the images are passed one at a time through an
embedding widget which filters and adds additional features for use within the classifier.

Orange’s system of work requires data submission to Orange’s servers, therefore dividing the full
amount of images into quarters was best to give accurate testing for the classifiers. If this was not
the case, the submission of more than 16,000 images to Orange’s servers at once would have caused
issues as encountered during the preliminary stages of using the framework.

41
Ultimately, four models (Logistic Regression, Support Vector Machine, Neural Network and Naive
Bayes) are created for each quarter of the data. The models are then evaluated evaluated using the
confusion matrices and other metrics to determine the most effective model for real world end user
usage.

Four models were created for each quarter of the car image data, for a total of 16 models. These
were each individually analysed within section 5.5. Overall, the ranking of the four archetypes of
model were: Logistic Regression, Support Vector Machine, Neural Network and Naive Bayes. The
accuracy in some cases was near 99%. However, the main issue identified with the models within 5.5
was that the training of visually similar cars, most notably of the same make or general model,
caused nearly all of the error.

This is understandable as human experts may be unable to distinguish differing models of car from
the same manufacturer some of the time. After analysing the error displayed by the confusion
matrices from testing the models, any given class tested upon is classified correctly at least twice as
much as any single incorrect class.

To run predictions, the image to be classified is run through the embedding component to extract
features to classify against. The model to classify with is loaded into the predictions component, the
output from this portion of the process is the class which the model believes the image belongs to,
or each image belongs to if multiple are loaded.

Once this class is predicted, the class name merged with information from the external dataset,
where information within pertains to the class. This is done by usage of concatenation of tables
within Orange. The unnecessary data that may have been merged from the dataset is subsequently
removed to only present useful information. The output is one record in the resultant table for every
car image that was input for prediction.

This research required several custom code pieces to be written to interpret, preprocess and order
the data for ingestion by the machine learning frameworks. This required notable proficiency with
Python, to facilitate systematic and sequential operations to be executed to ensure that the base
data of unordered car images could be processed into classifiers.

There are several areas the system could be further developed, such as translating the Orange
workflow into raw Python code for extended usability. The addition of other databases into the
system or translating the databases noted in the literature review to be implemented into the
workflow.

7 - References
[1] - The Sun. (2018, Feb. 19). ​Used car scam warning as rogue dealers use Facebook and Gumtree to
shift dodgy motors​ [Online]. Available:
https://www.thesun.co.uk/motors/5538653/used-car-scam-warning-as-rogue-dealers-use-facebook
-and-gumtree-to-shift-dodgy-motors/
[2] - The Zebra. (2018, July. 3). Buying a Car Out of State: Do’s, Don’ts, and Paperwork You’ll Need
[Online]. Available:
https://www.thezebra.com/insurance-news/1946/buying-a-car-privately/
[3] - Express. (2018, Aug. 6). ​Car scams on the rise in the UK - Here’s how to avoid becoming a victim
of the crime​ [Online]. Available:

42
https://www.express.co.uk/life-style/cars/999650/car-scam-online-hack-advice-tips
[4] - Citizens Advice. (2015, Oct. 8). Buying a used car [Online]. Available:
https://www.citizensadvice.org.uk/consumer/buying-or-repairing-a-car/buying-a-used-car/
[5] - ​3D Object Representations for Fine-Grained Categorization, ​Jonathan Krause, Michael Stark,
Jia Deng, Li Fei-Fei, ​4th IEEE Workshop on 3D Representation and Recognition, at ICCV 2013
(3dRR-13)​. Sydney, Australia. Dec. 8, 2013.
[6] - ​Vehicle Safety Branch Car Recalls Database​, data.gov.uk,
https://data.gov.uk/dataset/18c00cf3-3bb2-4930-b30d-78113113aaa7/vehicle-safety-branch-recalls
-database​, 2013.
[7] - ​Car Theft Index​, data.gov.uk,
https://data.gov.uk/dataset/cd4cdeb8-a199-4aef-8158-e567d0a2ac5a/car-theft-index​, 2010.
[8] - ​Anonymised MOT Tests and Results​, data.gov.uk,
https://data.gov.uk/dataset/e3939ef8-30c7-4ca8-9c7c-ad9475cc9b2f/anonymised-mot-tests-and-re
sults​, 2017.
[9] - ​Second Hand Car Price Estimation​, Xioa Jin,
https://www.kaggle.com/bahamutedean/secondhand-car-price-estimation​, 2018.
[10] - ​Car Features and MSRP​, Cooper Union,
https://www.kaggle.com/CooperUnion/cardataset​, 2016.
[11] - S. Saravi and E. A. Edirisinghe, "​Vehicle Make and Model Recognition in CCTV footage​," ​2013
18th International Conference on Digital Signal Processing (DSP)​, Fira, 2013, pp. 1-6.
[12] - ​Comparative Analysis of Data Mining Tools and Techniques for Evaluating Performance of
Database System​, Arpita M. Hirudkar, Mrs. S. S. Sherekar
[13] - Orange, ​https://orange.biolab.si/​, 2019
[14] - ​Lee H.J. (2006) ​Neural Network Approach to Identify Model of Vehicles​. In: Wang J., Yi Z.,
Zurada J.M., Lu BL., Yin H. (eds) ​Advances in Neural Networks - ISNN 2006. ISNN 2006. Lecture Notes
in Computer Science, vol 3973.​ Springer, Berlin, Heidelberg
[15] - ​DVSA says 1-13 used cars punishable safety recall​, Rob Hull,
https://www.thisismoney.co.uk/money/cars/article-5409173/DVSA-says-1-13-used-cars-punishable-
safety-recall.html​, 21 February 2018.
[16] - Weka, ​https://www.cs.waikato.ac.nz/ml/weka/​, 2019
[17] - Jovic, Alan & Brkić, Karla & Bogunovic, N. (2014). ​An overview of free software tools for
general data mining​. 1112-1117. 10.1109/MIPRO.2014.6859735.
[18] - WikiHow. (2019, Mar. 29). ​How to Increase Java Memory in Windows 7​ [Online]. Available:
https://www.wikihow.com/Increase-Java-Memory-in-Windows-7

43
Appendix 1 - Provisional Project Contents
Contents
Abstract
1 - Introduction
1.1 - Aims
1.2 - Objectives
1.3 - Methodology

2 - Background
2.1 - General Analysis
2.2 - Motivation for undertaking the Project

3 - Literature Review
3.1 - Literature Review
3.1.1 - Dataset Collection
3.1.2 - Literature
3.2 - Techniques used by others
3.3 - Gathering of findings - never been done before

4 - ​Comparison of Orange and Weka


4.1 - Properties and abilities of Orange and Weka
4.2 - Current usage of Orange and Weka
4.2.1 - Orange
4.2.2 - Weka
4.3 - Visual interfaces
4.3.1 - Orange’s Visual Interface
4.3.2 - Weka’s Visual Interface
4.3.3 - Visual representation of results
4.5 - Pros and Cons of each software used

5 - Prototype Design

6 - Creation and Implementation of Solutions

7 - Testing

8 - Results and Recommendations

9 - Conclusion

44
Acknowledgements

Reference​s

Appendices

45
Appendix 2 - Revised Work Plan

46
47

You might also like