0% found this document useful (0 votes)
1 views86 pages

2019 - Improving The Classification of Tiny Images For Forensic Analysis

The thesis by Roba Jafar Alharbi, submitted to the Florida Institute of Technology, focuses on improving the classification of tiny images for forensic analysis, particularly in the context of digital forensics. It evaluates various classification techniques, including PCA, KNN, and CNN, using the CIFAR-10 dataset, and finds that CNN provides the highest accuracy at approximately 74.10%. The research highlights the importance of effective image classification in aiding cybercrime investigations by reducing the time needed to analyze large datasets.

Uploaded by

sharmapriya075
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views86 pages

2019 - Improving The Classification of Tiny Images For Forensic Analysis

The thesis by Roba Jafar Alharbi, submitted to the Florida Institute of Technology, focuses on improving the classification of tiny images for forensic analysis, particularly in the context of digital forensics. It evaluates various classification techniques, including PCA, KNN, and CNN, using the CIFAR-10 dataset, and finds that CNN provides the highest accuracy at approximately 74.10%. The research highlights the importance of effective image classification in aiding cybercrime investigations by reducing the time needed to analyze large datasets.

Uploaded by

sharmapriya075
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 86

Florida Institute of Technology

Scholarship Repository @ Florida Tech

Theses and Dissertations

12-2019

Improving the Classification of Tiny Images for Forensic Analysis


Roba Jafar Alharbi

Follow this and additional works at: https://repository.fit.edu/etd

Part of the Information Security Commons


Improving the Classification of Tiny
Images for Forensic Analysis

By
Roba Jafar Alharbi

A Thesis submitted to the College of Engineering and Science at


Florida Institute of Technology
in partial fulfillment of the requirements
for the degree of

Master of Science
in
Information Assurance and Cybersecurity

Melbourne, Florida
December 2019
We the undersigned committee hereby approve the attached Thesis

Improving the Classification of Tiny Images for Forensic Analysis

by
Roba Jafar Alharbi

_________________________________________________
William Allen, Ph.D.
Associate Professor
Computer Science
Committee Chair

_________________________________________________
Veton Këpuska, Ph.D.
Associate Professor
Electrical and Computer Engineering
Outside Committee Member

_________________________________________________
Marius Silaghi, Ph.D.
Associate Professor
Computer Science
Committee Member

_________________________________________________
Philip Bernhard, Ph.D.
Associate Professor and Department Head
Computer Engineering and Science
Abstract
Title: Improving the Classification of Tiny Images for Forensic Analysis

Author: Roba Jafar Alharbi

Advisor: William Allen, Ph.D.

Forensics can be defined as the approach that connects with and uses in

governments and different organizations in order to detect any malicious activity.

Digital forensics has become an essential approach to cyber investigation. Image

forensics is one of the most beneficial ways that are used in digital forensics in order to

help investigators in cybercrimes. Therefore, investigators can discover some new

evidence besides what is already available on their systems when they use some digital

forensics techniques.

This thesis focuses on identifying an image based on its contents, especially tiny

images. We investigated ways to improve the performance of some data classification

techniques, such as principal component analysis (PCA), K- nearest neighbors (KNN),

and convolutional neural network (CNN). In order to test these different classification

techniques, we used feature extraction in order to extract the most useful features that

are used as inputs to the classifiers. Therefore, we used the CIFAR-10 dataset that

contains many tiny images, which is 60,000 32 x 32 color images.

Three different classification techniques are tested in order to identify the most

accurate algorithm for classifying the tiny image of the CIFAR-10 dataset. The results

iii
of our experiments showed that the best results were achieved when we used the

convolutional neural network (CNN). Therefore, CNN is the best classification

algorithm to use since it produced the best results matching approximately 74.10%

among the other two classification techniques that are used in this research, which are

PCA and KNN.

iv
Table of Contents

Table of Contents .....................................................................................................v

List of Figures ....................................................................................................... viii

List of Tables ............................................................................................................x

Acknowledgements................................................................................................. xi

Dedication ............................................................................................................. xiii

Chapter 1 ..................................................................................................................1

1.1 Overview ....................................................................................................1

1.2 Motivation ..................................................................................................3

1.3 Research Goals ..........................................................................................5

1.5 Research Organization .............................................................................7

Chapter 2 ..................................................................................................................9

2.1 Overview of Digital Forensic ....................................................................9

2.1.1 Data File Forensic ..........................................................................11

2.1.2 Network Forensic ...........................................................................12

2.1.3 Database Forensic ..........................................................................14

2.1.4 Web Forensic ..................................................................................15

2.2 CIFAR 10 Dataset ...................................................................................16


v
2.3 Overview of Classification Techniques .................................................17

Chapter 3 ................................................................................................................22

3.1 Framework ..............................................................................................22

3.2 Experimental Structure ..........................................................................26

3.2.1 A Suitable Software .......................................................................27

3.2.1 Dataset .............................................................................................28

3.2.2 Data Preparation ............................................................................30

3.3 Classification Techniques .......................................................................32

3.3.1 Principal Component Analysis (PCA)..........................................32

3.3.1.1 Overview of Principal Component Analysis (PCA) ................32

3.3.1.2 The Experiment with Principal Component Analysis (PCA) 33

3.3.1.3 The Results..................................................................................34

3.3.2 K- Nearest Neighbors (KNN) ........................................................42

3.3.2.1 Overview of K- Nearest Neighbors (KNN) ..............................42

3.3.2.2 The Experiment with K- Nearest Neighbors (KNN)...............43

3.3.2.3 The Results..................................................................................44

3.3.3 Convolution Neural Network (CNN)............................................50

3.3.3.1 Overview of Convolution Neural Network (CNN) ..................50

3.3.3.2 The Experiment with Convolution Neural Network (CNN) ..51

vi
3.3.3.3 The Results..................................................................................54

Chapter 4 ................................................................................................................60

4.1 Analyzing Results .........................................................................................60

Chapter 5 ................................................................................................................66

5.1 Conclusion and Future Work .....................................................................66

Chapter 6 ................................................................................................................69

6.1 Bibliography ..................................................................................................69

vii
List of Figures

Figure 1.1:Windows thumbnail images ................................................................5

Figure 2.1: Digital forensics .................................................................................11

Figure 3.1: The general framework of file identification ..................................23

Figure 3.2: The proposed framework of file identification ...............................25

Figure 3.3: CIFAR-10 [31] ...................................................................................29

Figure 3.4: The Mean Image ................................................................................31

Figure 3.5: Eigen Images ......................................................................................34

Figure 3.6: Showing the test image and matched image for two different types

of ‘Truck’ ........................................................................................................37

Figure 3.7: Showing the test image and matched image for two different types

of ‘Ship’ ...........................................................................................................38

Figure 3.8: Showing the test image and matched image for two different images

of ‘Automobile’ .............................................................................................. 38

Figure 3.9: Showing the test image and matched image for two different images

of ‘Cat’ ............................................................................................................39

Figure 3.10: Showing the test image and matched image for two different

images of ‘Horse’ ............................................................................................39

Figure 3.11: Showing the test image and matched image for two different

images of ‘Dog’............................................................................................... 40
viii
Figure 3.12: Showing the test image and matched image for two different

images of ‘Bird’ ............................................................................................ 40

Figure 3.13: Showing wrong matching between the test image and matched

image ...............................................................................................................41

Figure 3.14: Showing wrong matching between the test image and matched

image ...............................................................................................................41

Figure 3.15: The output of matching with different distence metrics ..............46

Figure 3.16: The output of function evaluations ................................................47

Figure 3.17: The output of applying the KNN....................................................48

Figure 3.18: KNN Results .....................................................................................49

Figure 3.19: Deep Learning Workflow of Classifying Images by Using CNN

[30] ...................................................................................................................50

Figure 3.20: CNN feature learning Layers ........................................................ 52


Figure 3.21: How the Convolution Neural Network (CNN) Works [30] .........53

Figure 3.22: The code of including all categories of CIFAR-10 dataset when

applying CNN ..................................................................................................55

Figure 3.23:The output of applying the CNN .....................................................56

Figure 3.24: The output of applying the CNN ....................................................57

Figure 3.25: The output of applying the CNN ....................................................58

Figure 3.26: Showing matching images between test image and matched image

..........................................................................................................................59

Figure 4.1: The chart of the PCA & KNN results ..............................................62

ix
List of Tables

Table 3.1: Classification results using Principal Component Analysis (PCA) .36

Table 3.2: Classification results using K- Nearest Neighbors (KNN) ...............49

Table 3.3: Classification results using Convolution Neural Network (CNN) ...59

Table 4.1: The results of the three different classification techniques. .............65

Table 5.1: A comparison of our results with Norko’s results [33] while using

some classification techniques. ......................................................................67

x
Acknowledgements

First and foremost, I would like to thank Allah, the one who is most deserving

of thanks and praise, for giving me the strength, ability, opportunity, knowledge, and

wisdom to finish this research and perseverance to complete it successfully. I want to

thank Allah again for supporting me during my educational journey to reach and achieve

my goal.

I want to express my deepest regards and sincere gratitude to my thesis advisor

Dr. William Allen who advised, supported, and inspired me to complete my research

work. Thank you for your time, suggestions, and guidance to help me finish this thesis

and make it full proof of success. Thank you for the valuable instructions that have

served as the primary contributor to the completion of my master thesis. I cannot express

enough thanks to you for the motivation, inspiration, and encouragement to achieve my

goal. I am grateful to have been supervised by such a professor like you, who has the

knowledge and skills to provide me with expert advice that showed me the right path for

completing this research.

I also would like to thank my committee members, Dr. Veton Kepuska and Dr.

Marius Silaghi, for the learning opportunities in many different fields and the

encouragement to finish my thesis work.

I want to take the advantage to thank my dear parents Mr. Jafar Alharbi and Mrs.

Ratebh Mugharbel, for raising me, standing next to me, motivating me, and supporting

me in my entire life, especially my educational journey to pursue my master’s degree.

xi
My parents, thank you for inspiring me to follow my dreams and try my best to reach

and achieve my goals and for encouraging me in all of my pursuits. My parents had a

dream to finish my master’s successfully, and their dream is coming true; thank you for

believing in me.

Special thanks to my sponsor, the Ministry of Higher Education, in Saudi Arabia

that granted me a scholarship to pursue my master’s degree.

xii
Dedication
This thesis is dedicated to my parents, who taught me how to be strong and

independent. My parents taught me how to face and deal with life challenges and never

give up and always be optimistic.

xiii
Chapter 1

Introduction
1.1 Overview
Forensic analysis is an approach that is used by governments and other

organizations to detect malicious or illegal activity. Government agencies, courts of law

and law enforcement use forensics for criminal cases and many other organizations use

forensics to detect violations of their policies and rules. Therefore, forensic science can

be defined as the science that relates to or deals with gathering data and applying some

techniques in order to solve legal problems. There are several categories for forensic

science based on the area that is dealing with, such as chemistry forensics, medicine

forensics, linguistics forensics, and mathematics forensics. Besides, the National

Institute of Justice defines the digital forensics as gathering and analyzing the digital

evidence in order to use them against criminals in the court; that evidence can be

presented to sue criminals for all kinds of crimes, not limited to computer crimes. Thus,

information that is stored or transmitted in binary form to use in the court called digital

evidence. Any information that is stored in digital form is considered as digital evidence,

such as images, audio, documents, log files, web cookies, and emails.

1
Most of the crimes that are based on computers directly or indirectly can be

identified by using some digital forensics tools. These digital forensics tools play a

significant role in law enforcement in order to apply investigations related to digital

media. Nowadays, digital forensics become the most common approach that is used in

crime investigations for most organizations due to the popularity of using digital media.

Moreover, some forensics tools, such as Guidance Software's Encase [11] and Access

Data's FTK [12] have combined some of the most efficient functions over the past 15

years [5]. However, there are some challenges that researchers should be aware of in

order to come up with some beneficial solutions.

In a federal level, there are some challenges that are categorized into three main

categories, which are [5,17]:

• Technical challenges

These challenges provide difficulties that may reduce or prevent law

enforcement's ability to catch criminals or bring suit against the criminals.

For instance, encryption, media formats, and anti-forensics.

• Legal challenges

These challenges have been produced based on laws and legal tools that are

necessary for cybercrime investigations, which based on the changes that are

related to society, technology, and structures. For instance, “jurisdictional

issues and a lack of standardized international legislation" [5].

• Resource challenges

These challenges come up when the investigators try to guarantee that they

have the whole data that they need for the investigation at all the
2
government’s level. For instance, analyzing time for forensic media and data

volume.

1.2 Motivation
Many researchers proposed and tested various techniques in order to come up

with effective results that improve accuracy due to the importance of digital forensics in

identifying the image content. Moreover, image forensic is one of the most popular

approaches that is used to apply digital investigation by using some analysis and

classification techniques. The main advantage of using this approach is gathering

evidence that is appropriate to be presented in a court of law by investigators.

In cybercrime, investigators can discover some new evidence besides what is

already available on the system when they use some digital forensics techniques.

Therefore, digital forensics techniques help the investigators to discover and recover

some information that may criminals partially delete or manipulate to hide their

suspicious and malicious activities.

Image forensic is one of the most beneficial ways that are used in digital

forensics in order to extract useful data. Then, the data extraction result is analyzed by

investigators instead of analyzing all data that is stored in the system. Therefore, image

forensics reduces the time consumption of analyzing the data due to the fact that the

image forensics techniques will only analyze and compare useful data.

For instance, if investigators try to find a specific car that is owned by someone,

and they only have an image of the car and many images from a surveillance camera.

3
Moreover, the investigators will try to find a match for this particular car. Therefore, the

investigators will search for this specific image of the same car by analyzing all the

images and spreading out the car images. It means that the investigators will not look

through all the images, and they will only look through car images. Thus, classifying

those images may reduce thousands of car images to only hundreds of car images, which

makes the searching for this particular car much faster than the traditional ways.

Subtracting similar images and labeling them simplify the work for the examiners since

they do not have to search and look through all other images, which leads to reduce the

time consumption.

However, digital forensics that are related to identifying the image based on its

content is considered as a significant challenging issue due to the high false-positive rate.

The percentage of images that are incorrectly identified as the correct image is the

definition of the false-positive rate. Furthermore, there are many reasons for the high

false-positive rate, such as missing an essential part of the image. For instance, images

that contain many objects on it, such as people, cars, buildings, and traffic signs, are

challenging and may increase the false positive rate.

The focus on tiny images, also called thumbnail images, is motivated by the fact

that operating systems will commonly create smaller versions of full-sized images stored

on a disk drive to display in file browsers or as icons, as shown in Figure 1. These

images may still be accessible after the original, larger image has been deleted,

encrypted, or stored in an archive that cannot be searched. If forensic examiners could

classify these tiny images, they could be useful evidence that the original images existed

and may even be hidden on the disk.

4
Figure 1.1: Windows thumbnail images

Therefore, we focused on this area of digital forensics, which is identifying

image content to test accuracy while using some of the well-known classification

techniques. We will apply three classification techniques, which are principal component

analysis (PCA), K-nearest neighbors (KNN), and convolutional neural network (CNN).

Furthermore, we will compare our results with some other existing studies that have

conducted to determine the improvement percentage after applying those techniques

based on a specific dataset.

1.3 Research Goals


This thesis describes work done to investigate different algorithms for the

classification of tiny images from the CIFAR-10 dataset. There are two goals of this

research, which are:

5
• First, to determine if different sizes of training and test data can improve the

accuracy of algorithms such as principal component analysis (PCA) and K-

Nearest Neighbor (K-NN), which are commonly used for file type

identification and image classification.

• Second, to determine whether a more recent classifier, Convolutional Neural

Networks (CNN), can provide greater classification accuracy.

1.4 Research Contribution


To meet the goal of our research, we introduce our primary research contributions as
follows:

• To determine the current state of researches in examining different classification

techniques on file type identification, we proposed a literature review of some

recent researches that produced some studies that have been conducted in this

field of classification techniques as are described in the background in Chapter

2.

• To standardize the results for comparison, we used a public source known as

CIFAR-10 dataset [31] to implement the chosen classification techniques as is

described in Chapter 3. We used the CIFAR-10 training data to train the model

and testing data to evaluate those techniques and compare the results across all

other approaches that were implemented.

6
• To compare the performance of data classification with the selected approaches

(see next contribution), we analyzed the results of the chosen classification

techniques, as explained in Chapter 4.

• To compare the results with an existing study that has been done by Norko [33],

we evaluated several potentially useful techniques to improve the classification

of tiny images. The approaches we intended to evaluate come from feature

extraction and well-known classification approaches. We have conducted

experiments with those techniques using the CIFAR-10 dataset, then compare

our results with Norko [33] as is fully described in Chapter 5.

1.5 Research Organization


The body of the thesis summarized in the following sections:

• Chapter 2: The background, which is including a literature review on digital

forensic types, particularly image forensics. This chapter presents an

overview of existing studies that have been conducted on digital forensics

and some data classification techniques that have been proposed and tested.

• Chapter 3: The methodology we followed and the classification techniques

that we used to build our model and tested it. This chapter presents the data

structure, the dataset, and the three classification techniques that we used to

train and test each classifier.

7
• Chapter 4: Analyzing results. This chapter represents the results and the

accuracy of each data classification techniques. Furthermore, this chapter

provides a comparison of our results across all the chosen classification

techniques in this research.

• Chapter 5: A conclusion and future work. This chapter presents a summary

of all the sections that are provided in the thesis and a comparison of our

results with the results of an existing study. Moreover, this chapter presents

some of our future works.

• Chapter 6: A bibliography.

8
Chapter 2

Background

2.1 Overview of Digital Forensic

Digital forensic has become an essential approach to cyber investigation. There are

many studies that have been done to examine some existing tools. Thus, it leads to come

up with some questions about the future of this field. Some studies have been conducted

in order to identify the existing digital forensics challenges, such as challenges related

to data size (database and hard drive) and the technology platforms variation (tablets,

mobile, or cloud computing) [5, 6, 7].

Dezfoli et al. [8] aimed to provide some of the trends in digital forensics and security

applications. Also, in this field, some assessments about future researches are introduced.

Furthermore, there are some criminal activities that are related to computers and mobile

phones due to the fast evolution of these devices [8]. Moreover, these devices are used

by criminals to achieve their nefarious goals. The investigation of crimes that involve

these devices becomes hard because most of these devices are complex. Therefore, there

is a considerable need for having efficient security measures to help investigators in

crime investigation. The way that is used to investigate computer crimes in the cyber

world is called digital forensics. However, this area of forensic investigation catches

9
researchers' attention, and they have done many experiments to find some

countermeasures for some existing challenges [8].

Al Fahdi et al. [5] focused on determining the digital forensics challenges that are

related to data size and technology platforms variation. It leads to make other researchers

focus on the issues that have an impact on the field. The researchers presented a survey

of all other studies that have been conducted in the field of digital forensics in law

enforcement for some different organizations. Therefore, this survey distinguishes

between real challenges and anticipated challenges in order to realize the future impacts

on this domain.

The results show that most of the participants, around 93%, believed that the

investigation number and complication might boost in the future. The study has been

done based on 42 participants, who were from different knowledge levels, such as 45%

academic researchers, 16% law enforcement, 31% have a role in forensic on an

organization, and 55% have three or more experience years [5]. The results have also

identified some future challenges that are related to anti-forensics, encryption, and cloud

computing. Moreover, the results show that communication between practitioners and

researchers should be improved. They have to evolve the extracting and identifying data

by using some techniques, such as criminal profiling [5].

Digital forensics consists of four types, as is shown in Figure 1.1, which are data file

forensics, network forensics, database forensics, and web forensics. Each type has a

specific area that is focused on and used to help investigators in some cases. We will

describe each type in the following subsections.

10
Network
Forensic

Data File Digital Database


Forensic Forensic Forensic

Web
Forensic

Figurer 2.1: Digital forensic types

2.1.1 Data File Forensic


Data file forensic aims to identify different file types, such as doc, excel, pdf, ppt,

txt, jpg, WAV, and mp3. However, there are some attackers who can identify the file's

type in order to hide their malicious activities. Therefore, the issue of file type

identification catches the researchers' attention due to the importance of analyzing these

data in order to detect any suspicious activity caused by attackers. Various studies have

been done in the area of file-type identification in order to come up with new methods

and approaches that can be useful for improving the existing techniques [1, 2, 3, 4].

11
Investigators use file type identification technologies to analyze and extract valuable

information from some different digital devices, such as phones, tablets, CDs, hard disks,

and computers. However, damaged hard-disks, broken CDROMs, and deleted files

(partly) have been considered as some challenges that may be faced by forensics

practitioners [2].

Ahmed et al. [1] compared and tested 500 files of each ten different file types that

are related to images, sounds, and documents, such as JPG, MP3, and PDF, and the

highest accuracy was when they applied the k-nearest neighbor algorithm about 90%

while using 40% of the features [1].

Amirani et al. [3] focused on feature‐based type identification of file fragments.

They tested different classification techniques by using parts from each file and applying

the feature extractions. After applying the feature extraction, some classification

techniques are used to detect the file type and identify the accuracy of each classifier,

such as principal component analysis (PCA) and neural network (NN).

2.1.2 Network Forensic


Network forensic aims to monitor the traffic of a computer network. Therefore,

there are some studies that have been conducted in order to test and improve some of the

network forensics techniques [9, 13, 14, 26, 27].

Meghanathan et al. [9] introduced some tools and techniques based on the

efficiency, ease of use, and cost to manage network forensics. Furthermore, discovering

evidence for network forensics happens by capturing, analyzing, or recording the


12
network events to identify the origin of these attacks and provide this information to a

court of law [9].

Ghafarian, A. in [13] defines network forensic as archiving all network traffic in

order to analyze any part of data that is captured when it is needed. For instance, network

forensic is used to find out some evidence, detect any intrusion, and respond to any

incident. Besides, the network traffic is checked in real-time by network administrators

while using network traffic monitoring tools.

Moreover, network forensics analysis tools (NFATs) are used by network

administrators. There are some advantages of using these tools, such as capturing parts

of traffic segments and analyzing the transmitted data over the networks in order to

investigate any attack or any suspicious activity. NFATs support deep defense, which

means that the data of some other security tools, such as firewalls, can be correlated with

NFATs.

Additionally, there are many uses for these network forensics analysis tools

(NFATs), such as future attack targets prediction, network performance assessment,

intellectual property protection, data theft analysis [13,14]. Some NFATs are beneficial

for improving network security level by detecting criminals and avoiding crimes.

Ghafarian tested some open-source network forensics analysis tools in order to find out

strong and weak points for each one of them. The experiment results were presented in

[13] based on reporting, analyzing, collecting, filtering, and extracting data.

13
2.1.3 Database Forensic
Database forensic aims to check data that is stored in some of the database

management systems (DBMSs) in order to ensure the databases' integrity. Therefore,

by analyzing the database, investigators will be able to identify the integrity of the data

and see if the database has been compromised.

Nowadays, most governments and organizations use various database applications

to store their data, which means the high-security level is required to protect that data

from an attack. This kind of database has become a target for an adversary due to the

popularity of using it to store relevant data. Therefore, some techniques are used to

understand the changes in the database and determine if the data have been manipulated

by an attacker or not.

Khanuja et al. [16] focused on the databases that are used in banking transactions

since most financial organizations use different database applications to store and

manage their data. Therefore, fraudulent banking activities have become an essential

issue because it threatens the trust and the security level of using online banking

transactions [16, 18,19]. This problem of dealing and detecting these kinds of financial

crimes catches the researchers' attention due to the Money Laundering (ML) practices.

The Money Laundering (ML) practices are considered one of the most dangerous

financial crimes. Having an effective Anti-Money Laundering (AML) has become a

significant challenge for all banks, which leads to an increase in the worriedness about

the security level of online bank transactions [16, 17].

The researchers in [16] provided an approach to guarantee the information that is

stored in banking databases, which is continuous assurance technology and transaction


14
monitoring through the digital forensic methodology. The researchers in [16] focused

on this area and addressed it in [20, 21, 23]. Moreover, database forensics is one area of

digital forensics that the researchers focus on in order to get useful information after

analyzing the extracted data [23, 24, 25]. Thus, these data can be used as evidence in

many different situations, such as legal, civil, and criminal proceedings.

2.1.4 Web Forensic


It is also known as Internet forensic. Nowadays, most users use the web in their

daily life, which leads criminals to expose any information that may help them to achieve

their suspicious goals. If criminals succeed in determining the weakness points in web

security, they will exploit them to reach and achieve their malicious goals. However,

web forensics will help investigators to detect and collect evidence of any illegal

activities.

Meghanathan et al. [9] identified that web forensic is based on gathering essential

information for any crime to determine the criminals and any suspicious activity. Web

forensics deals with all collected data that are related to the web. For instance, collecting

data can be done by finding the browsing history of a user, such as how many times the

user visited a specific website, how long the user spent on each visit, what files that the

user downloaded and uploaded into the visited website, and much more critical

information.

Khanikekar, Sandeep Kumar. in [10] provided examples of some known criminal

activities in the cyber world, such as identity theft and compromised web servers.
15
Therefore, some web forensics techniques play a significant role in helping investigators

to detect evidence of any suspicious or illegal activity.

2.2 CIFAR 10 Dataset

CIFAR-10 dataset is a well-known dataset that uses to train and test models.

Therefore, Norko in [33] examined some classification techniques in order to run,

analyze, and improve those techniques by using a dataset that contains many simple

images. The dataset that is used is CIFAR-10 to test some simple image classification

algorithms with using and without using the principal component analysis (PCA).

However, the CIFAR-10 dataset is described with more detail in Chapter 3.

Furthermore, the researcher used 10000 32x32 color images, and those images

are classified by using the k - nearest neighbor (KNN) and linear support vector machine

(SVM) classification algorithms to make a comparison that shows the accuracy. This

paper [33] uses different accuracy measures for representing the results of the

classification techniques while using the training and testing data. Those different

accuracy measures are the confusion matrixes, and the ROC (receiver operating

characteristic) curves for the evaluation.

The researcher in [33] used MATLAB to implement his experiments, especially

the MATLAB Classification Learner (MCL) interactive application, in order to create

the classification models. The researcher trained and examined the models by using the

16
classification learner application. He exported and applied the created model to the test

data in order to provide and represent the evaluation results.

2.3 Overview of Classification Techniques

Computer forensics investigators try to test and analyze various file types by

developing various tools, methods, and techniques in order to discover and detect any

suspicious activity that is related to file-type identification and make the required

procedures. However, file-type identification is considered an obstacle that may the

researchers faced. Therefore, the researchers have been provided several approaches

and techniques that can be used to identify various types of files, such as jpg, doc, pdf,

and mp3.

The issue of file type identification catches the researchers' attention due to the

importance of analyzing these data in order to detect any suspicious activity caused by

an adversary. Therefore, several studies have been conducted in the area of file-type

identification in order to come up with new methods and approaches that can be useful

for improving some of the existing classification techniques, such as principal

component analysis (PCA), K-nearest neighbors (KNN), and convolutional neural

network (CNN) [1, 2, 3, 4, 34, 35].

Ahmed et al. [1] compared and tested 500 files of each ten different file types that

are related to images, sounds, and documents. The researchers used two different
17
techniques in order to minimize the classification time. The first technique is feature

selection, which is about determining some features that have the highest appearance in

the dataset. The second technique is content sampling, which is taking random samples

of file blocks in order to expedite the classification process.

Moreover, six classification algorithms are used to perform the experimental tests

for those techniques, such as artificial neural network, decision tree algorithm, k-means

algorithm, k-nearest neighbor algorithm, linear discriminant analysis, and support vector

machine [1]. Consequently, 500 files of each ten different file types, such as JPG, MP3,

and PDF, are tested and compared by the researchers. Thus, the highest accuracy was

when they applied the k-nearest neighbor algorithm about 90% while using 40% of the

features [1].

Amirani et al. [3] proposed a content-based method based on principal component

analysis (PCA) and unsupervised neural networks (NN) for applying feature extraction.

After they applied the feature extraction, they used some classification techniques to

detect the file type and identify the accuracy of each classifier. The researchers come up

with results that show the accuracy of each classification techniques depending on the

length of the file fragments. The principal component analysis (PCA) is one of the most

popular techniques that is used by researchers for dimension reduction. Furthermore,

using a support vector machine (SVM) classifier improves the accuracy and speed.

Gopal et al. [2] focused on comparing the performance of statistical classifiers, such

as SVM and KNN, and some knowledge-based approaches, particularly COTS

(Commercial-Off-The-Shelf) solutions. Moreover, file-type Identification (FTI) is

considered as a fundamental issue in different fields, such as digital forensics and

18
intrusion detection. However, state-of-the-art classification techniques help the

researchers to solve FTI problems by enhancing some of the existing techniques or

propose new methods or tools.

In paper [2] the researchers interduce three different aspects, which are:

• A systematic investigation of the problem of the file-type identification FTI

is provided.

• Algorithmic solutions are presented in order to solve the FTI problem.

• An evaluation methodology, which shows the results of implementing some

file-type identification techniques and then analyze them.

The researchers analyzed some different methods that are used to handle the

damaged files and file segments in order to evaluate the robustness of the tested methods.

Therefore, two alternative criteria that are used to measure the performance are proposed

by the researchers [2], which are:

• The filenames extensions are treated as true labels.

• The predictions by using the based approaches to know intact files are treated

as true labels.

Additionally, these two criteria are based on signature bytes as the true labels, and

these signature bytes are removed before testing each method. As a result, the

researchers concluded that SVM and KNN provide the best results over all other COTS

solutions that are tested using simulated damages in files [2]. The results of their

experiments show an improvement in the classification accuracy. However, there are

some COTS methods that would not be able to identify damaged files [2].

19
Vulinović et al. [34] one of the most fundamental steps in data file forensics is file

fragment classification, which identifies the file type based on its content fragments.

Therefore, byte frequency distributions and fragment entropy measures are two methods

that are used to identify the file using machine learning techniques on features. In

addition, the researchers provided a contribution in identifying the file based on its

content fragments by implementing some known techniques, such as feedforward

artificial neural networks and convolution networks (CNN).

The researchers trained the Feedforward neural networks by using byte histograms

and with byte-pair histograms while they trained convolution neural networks (CNN) by

using blocks containing 512 bytes of data. These data are acquired from the GovDocs1

dataset. As a result, the researches evaluated these two classification techniques, and

they indicated that the feedforward artificial neural networks provide better results than

the convolution neural networks (CNN) [34].

Al-Saffar et al. [35] Convolutional neural networks (CNNs) are more beneficial

than traditional machine learning methods due to the increasing number of hidden layers

that provide more complex network structure and more robust feature learning.

Moreover, the deep learning algorithm is used to train the convolution neural network

model, which leads to providing extraordinary achievements when applying extensive

identification tasks. The development of deep learning and convolution neural network

is presented with summarizing the basic model structure, convolution feature extraction,

and the pooling operation that is considered as one of the most essential operations for

applying the CNN classification. However, the convolutional neural networks (CNNs)

method is described in detail in Chapter 3.

20
After that, the researchers reviewed the development of the convolution neural

network (CNN) model using deep learning of image classification. The researchers

represented that by introducing the training method and the performance. However, the

researchers outlined and discussed some existing problems and predicted the new

direction of future development [35].

Chen et al. [36] proposed a novel schema that is used to improve the classification

accuracy by extracting much more hidden features. This proposed novel is based on

fragment-to-grayscale image conversion and deep learning due to the fact that the file

fragments classification is an essential step in digital forensics. Thus, they produced a

deep convolution neural network (CNN) model that have the ability to extract thousands

of features by finding the non-linear connections between neurons. Moreover, the

researchers used the GovDocs1 dataset to train and test their CNN model. The results

of their experiments with their proposed CNN model achieved 70.9% correct

classification accuracy, which is higher than some existing studies [36].

21
Chapter 3

Methodology

The previous chapter discussed the background, which represents a literature

review on digital forensics and its four different types, and some classification

techniques that are used to classify the file based on its content. This chapter describes

the research methodology, which presents our work in detail. Therefore, this section

reflects the goals of our fundamental research.

3.1 Framework
We start this chapter with a clarification of steps that are followed in our

research. Our research focuses on identifying useful features by using different

dimension reduction and feature extraction techniques. By applying those techniques,

the number of features is reduced to a smaller number of features with keeping the most

useful features, which helps the classifier to determine the file based on its contents.

22
Figure 3.1: The general framework of file identification

23
In addition, we designed a general framework of the steps that are used to classify

data and identify the content of tiny images in order to clarify the approach that is used

in this research. Thus, our general framework contains four stages, which are explained

below and shown in Figure 3.1.

• Stage 1: Downloading the CIFAR-10 dataset.

• Stage 2: After downloading the dataset, the data will be preprocessed

automatically and split into two sets: the training set and the test set.

• Stage 3: Before the classification stage, the features can be reduced by using

some of the feature reduction techniques. These techniques minimize the

number of features in training set to a smaller number with keeping the useful

features.

• Stage 4: In this stage, the data in the training set is classified, followed by

the evaluation of the classification process, which is come up after validating

the classification technique with the test set step. Finally, the performance

rate can be identified after obtaining the results.

24
Figure 3.2: The proposed framework of file identification

25
In this section, the proposed framework is introduced (see Figure 3.2), which is

designed to build a model for classifying the tiny images. The proposed research

framework contains four stages, which are:

1. A number of images with a Portable Network Graphics (PNG) format are

chosen from the CFAR-10 dataset.

2. Using the CIFAR-10 dataset and splitting data automatically into a training

set and test set.

3. Using the dimension reduction techniques on the training set only in order to

reduce the data features and keep the most useful ones.

4. Using the classification techniques, which are Principal Component Analysis

(PCA), K- Nearest Neighbors (KNN), and Convolutional Neural Networks

(CNNs), in order to classify the training data; followed by evaluating the

trained dataset using the chosen test data. Thus, several classifiers are used

to assess the testing data. Then, the classification results can be obtained.

3.2 Experimental Structure


In this section, the fundamental structure of the research experiment steps is

described. This research structure contains the preprocessing experimental study, which

consists of the following two subsections:

1. The suitable software tools that are used in our research are determined.

2. The dataset description shows the contents of the chosen dataset in detail.

26
3. The data preparation that is used with each classification technique is

described.

3.2.1 A Suitable Software


MATLAB is a high-level program language that is used in visualization,

analyzation, and development due to its reliability and robustness. It is a beneficial

software that is used in data analyzation, algorithm development, models, and

application creation. It is a strong software tool that is used to solve mathematical

problems, implement algorithms, and compute the performance of proposed algorithms.

Besides, MATLAB has a set of toolboxes, which include several functions that

are used to solve problems and enhance performance. For instance, there are some

toolboxes that are related to various areas, such as audio, images, and computer vision.

Moreover, there is the MATLAB Classification Learner (MCL) interactive application,

which is used to create the classification models.

Therefore, in our experiments, we used the MATLAB tool, version R2018b, to

implement the research experiment in order to test some of the chosen classification

techniques, such as PCA, KNN, and CNN with tiny images. Some MATLAB toolboxes

are needed in our experiment in order to obtain good results, such as image processing

toolbox and deep learning toolbox.

27
3.2.1 Dataset
The dataset that is used in our experiment is the CIFAR-10. The CIFAR-10 is

composed of 60,000 32x32 color images divided into ten different classes, and each class

has 6,000 images. Those images divided into two subsets, which are training and test

images. Therefore, the training set consists of 50,000 images, while the test set consists

of 10,000 images [31]. The ten different classes that the images are divided into are:

Airplane, Automobile, Bird, Cat, Deer, Dog, Frog, Horse, Ship, and Truck. Moreover,

all the images in the CFAR-10 dataset are in the Portable Network Graphics (PNG)

format.

Krizhevsky et al. [31] divided the CIFAR-10 dataset into five training batches

and one test batch, and each batch has 10,000 images. The test batch is formed of 1000

images that are selected randomly from each class. Therefore, the remaining images are

stored in the training batches in random order. However, some training batches may be

composed of more images from one class than others. Among that, the training batches

are composed of 5000 images precisely from each class.

The ten classes that the CIFAR-10 consist of are represented by ten random

images from each class, as is shown in Figure 3.3:

28
Figure 3.3: CIFAR-10 [31]

"The classes are completely mutually exclusive" [31]. For instance, the classes

automobiles and trucks do not have overlap between each other due to the fact that

sedans, SUVs, "things of that sort" are included in automobiles. While only the big trucks

are included in truck class, which means the pickup trucks do not belong to the truck

class.

29
3.2.2 Data Preparation
The data preparation is a process of cleaning raw data to make the data ready for

feeding the model. In our experiment, the CIFAR-10 dataset is used to test three

different classification techniques, which are principal component analysis (PCA), K-

Nearest Neighbors (KNN), and Convolution Neural Network (CNN). Therefore, the

CIFAR-10 data goes through some processes to have suitable data for training the model

in order to obtain a high correct classification accuracy rate. The processes that are

followed and applied with the CIFAR-10 data are:

• First, we used the resizing process to make sure that all images that are used in

our experiment have the same size, which is 32x32. It means after the resizing;

we will have three-dimensional colorful images with a size of 32x32, which are

considered as tiny images. Thus, by implementing this step, there will be no

need to worry about the data size of the images that will be added to the

experiment due to the fact that the resizing process will take care of all the

images.

• Second, the CIFAR-10 dataset is divided into two sets, which are the training set

and the testing set. Thus, in order to train data, we have vectorized the image by

using the reshape function, which means that the image will turn out to be a

vector with 1024 elements. The same process is applied for the whole training

and testing data. For instance, if we choose 2000 images for the training set, the

size of the whole data will be 2000x1024. Then, labels are saved, which are

corresponded each data. This process called supervised learning.

30
• Third, we normalized the data by extracting the mean value of the whole data,

which helps us to stabilize the data (see figure 3.4). Moreover, this step helps us

to move the data to the center where has zero mean. Therefore, all of these

processes are applied for training and testing data to prepare the data for the

classification.

Figure 3.4: The Mean Image

31
3.3 Classification Techniques
This section presents the three classification techniques that are used in the

experiment. These classification techniques were chosen because they have been used

in the previous work, which are principal component analysis (PCA) [3, 28, 29, 33], K-

nearest neighbors (KNN) [2,1], and convolutional neural network (CNN) [34,35]. The

algorithms that are chosen are the ones that are used for images and other kind of data

analysis in previous work as is shown in detail in Chapter 2. Moreover, in this section,

we provide the steps that are followed in the experiment when applying each of these

classification techniques and their results.

3.3.1 Principal Component Analysis (PCA)

3.3.1.1 Overview of Principal Component Analysis (PCA)


The principal component analysis (PCA) is one of the most important techniques

that is used in statistics data science. Karamizadeh et al. [28] defined the principal

component analysis (PCA) as an orthogonal transformation that is used in order to

convert a collection of observations of some possible data correlation (correlated

variables) into a collection of values that have linearly uncorrelated variables.

PCA is a tool that is used for dimension reduction, which means minimize

multidimensional data to fewer dimensions (e.g., reducing the three dimensions image

to two dimensions image). PCA keeps the most useful data instead of whole data in

order to compare it and obtain the needed results with less amount of time. In order to

32
apply the PCA, there will be a need to calculate the Euclidean distance, covariance

matrix, and eigenvectors.

Therefore, PCA is one of the most beneficial statistical technique that is used in

some applications, such as image compression and faces recognition due to its usability

to find patterns of data that have high dimensions [28, 29].

3.3.1.2 The Experiment with Principal Component Analysis


(PCA)
The strategy of the Eigen images method is followed. It starts with extracting

the most useful features of each image and then representing the image that is wanted as

a linear combination. These Eigen images are acquired from applying the feature

extraction technique (see Figure 3.5). Then, the principal components analysis (PCA) of

the images in the training dataset is applied. Furthermore, a comparison of the images

is made; it based on the Euclidian distance of the Eigen images and the eigenvectors of

those Eigen images. According to the Euclidean distance formula, the distance between

two points in the plane which coordinates (x, y) and (a, b) is given by:

Dis(d) = !(𝑥 − 𝑎)' + (𝑦 − 𝑏)'

The eigenvectors are founded from the covariance matrix. Therefore, if this

distance is small enough, the image is matched correctly. Thus, if the distance is too

large, the image is not matched. In addition, the researchers in [28] applied the same

strategy that we used, but it was with matching faces.

33
3.3.1.3 The Results

After applying the principal component analysis (PCA), we obtained the

following results. Those results consist of the first nine eigen images (see Figure 3.5),

the matching results between the testing data and the training data, and the accuracy, as

is described below.

Figure 3.5: Eigen Images

34
In our experiment, we applied the Principal Component Analysis (PCA) with four

different attempts that contain different numbers of training and testing data, as is shown

in Table 3.1. Different numbers for the training and testing data are chosen, and then

these numbers are multiplied by 10, which is the number of the classes. For instance, if

the number of the training data was 2,000, it means 200 images will be chosen from each

class and the same for the testing data. If the testing data was 400, 40 images would be

chosen from each class. Moreover, the number of training and testing data were

increased in each next attempt. The four attempts are described below, which are:

• In the first attempt, the number of the training data was 1,000, and the testing

data was 200. As a result, the correct classification accuracy rate with the PCA

was 24.44% due to the 44 images that were matched correctly.

• In the second attempt, the number of training and testing data were increased.

Therefore, the number of the training data was 2,000, and the test data was 400.

As a result, the correct classification accuracy rate with the PCA was 20% due

to the 72 images that were matched correctly.

• In the third attempt, the number of training and testing data were increased.

Therefore, the number of the training data was 3,000, and the test data was 600.

As a result, the correct classification accuracy rate with the PCA was 17.78%

due to the 96 images that were matched correctly.

• In the fourth attempt, the number of training and testing data were increased.

Therefore, the number of the training data was 4,000, and the test data was 800.

As a result, the correct classification accuracy rate with the PCA was 19.31%

due to the 139 images that were matched correctly.


35
Table 3.1: Classification results using Principal Component Analysis (PCA)

Principal Component Analysis (PCA)

Number of Number of Number of Images Correct Classification


Training Data Test Data Matched Correctly Accuracy Rate (%)

1,000 200 44 24.44%

2,000 400 72 20%

3,000 600 96 17.78%

4,000 800 139 19.31%

The results of applying the experiment showed the correct matching of the test

image from the testing data and the matched image from the training data (see Figures

3.6, 3.7, 3.8, 3.9, 3.10, 3.11, 3.12, and 3.13). Moreover, in table 3.1, the results showed

the number of training data and the testing data that are chosen in each attempt. Also, it

shows the number of images that are matched correctly in each attempt. Besides, since

the dataset that we used is divided into two different sets, which are the testing data and

the training data, the results will show the correct matching between these two sets. For

instance, Figure 3.6 shows the test image and matched image for two different types of

‘Truck’ that are matched correctly.

Furthermore, Figures 3.7 shows the test image and matched image for two

different types of ‘Ship’ that are matched correctly. Figures 3.8 shows the test image

and matched image for two different images of ‘Automobile’ that are matched correctly.

Figures 3.9 shows the test image and matched image for two different images of ‘Cat’

that are matched correctly.


36
Moreover, Figures 3.10 shows the test image and matched image for two

different images of ‘Horse’ that are matched correctly. Figures 3.11 shows the test image

and matched image for two different images of ‘Dog’ that are matched correctly. Figures

3.12 shows the test image and matched image for two different images of ‘Bird’ that are

matched correctly. On the other hand, figures 3.12 and 3.13 show the results of wrong

matches between the test image and the matched image.

Figure 3.6: Showing the test image and matched image for two different types of
‘Truck’

37
Figure 3.7: Showing the test image and matched image for two different types of
‘Ship’

Figure 3.8: Showing the test image and matched image for two different images of
‘Automobile’

38
Figure 3.9: Showing the test image and matched image for two different images of
‘Cat’

Figure 3.10: Showing the test image and matched image for two different images
of ‘Horse’
39
Figure 3.11: Showing the test image and matched image for two different images
of ‘Dog’

Figure 3.12: Showing the test image and matched image for two different images
of ‘Bird’

40
Figure 3.13: Showing wrong matching between the test image and matched image

Figure 3.14: Showing wrong matching between the test image and matched image

41
3.3.2 K- Nearest Neighbors (KNN)

3.3.2.1 Overview of K- Nearest Neighbors (KNN)


The K-Nearest Neighbors (KNN) is one of the simplest supervised machine

learning algorithms that is mostly used for classification. The KNN is one of the oldest

techniques that is used in pattern classification [37]. It depends on the finding some of

the different distance metric that are used to determine the nearest neighbors [37]. KNN

classifies a data point based on how its neighbors are classified. Therefore, it stores all

available cases and classifies new cases based on a similarity measure.

Therefore, many studies have been conducted in order to test the KNN

classification technique as is shown in Chapter 2. Ahmed et al. [1] applied six different

classification algorithms in order to compare and test 500 files of each ten different file

types that are related to images, sounds, and documents, such as JPG, MP3, and PDF.

As a result, the highest accuracy was when they applied the k-nearest neighbor algorithm

about 90% while using 40% of the features [1].

There are some advantages to using KNN, which are:

• Robust to noisy training data.

• Learn complex models easily.

• It is more effective than some other techniques, such as PCA.

42
3.3.2.2 The Experiment with K- Nearest Neighbors (KNN)

Moreover, there is a need to choose K, which is a parameter that refers to the number

of the nearest neighbor to include in the majority voting process. Thus, choosing the

right value of K is a process called parameter tuning, which is essential for getting better

accuracy.

Besides, in order to identify the nearest neighbors and produce the optimal results,

the type of the distance metric that provides the best result had to be determined, such as

Euclidean distance, hamming, cosine, and correlation.

Recap of the KNN:

• A positive integer K is specified, along with a new sample.

• We select the K entries in our database which are closest to the new sample.

• We find the most common classification of these entries.

• This classification will be given to the new sample.

However, there are some disadvantages that may affect the results, which are:

• The main drawback of KNN is the complexity of searching the nearest neighbors

for each sample because it will be lots of elements in large datasets [32].

• There is a need to determine the value of the parameter K, which is the number

of the nearest neighbor. It is hard to apply that with the data that have high

dimensions.

• Computation cost is high due to the fact that there is a need to compute the

distance of each query instance to all training samples.

43
3.3.2.3 The Results

In our experiment, we applied the K- Nearest Neighbors (KNN) with the same four

numbers for training and testing data, that are used with the PCA. The four different

attempts are shown in Table 3.2. The number of training and testing data were increased

in each following attempt. The four attempts are described below, which are:

• In the first attempt, the number of the training data was 1,000, and the test data

was 200. As a result, the correct classification accuracy rate with the KNN was

39%.

• In the second attempt, the number of training and testing data were increased.

Therefore, the number of the training data was 2,000, and the test data was 400.

As a result, the correct classification accuracy rate with the KNN was 40%.

• In the third attempt, the number of training and testing data were increased.

Therefore, the number of the training data was 3,000, and the test data was 600.

As a result, the correct classification accuracy rate with the KNN was 38.19%.

• In the fourth attempt, the number of training and testing data were increased.

Therefore, the number of the training data was 4,000, and the test data was 800.

Unfortunately, the program is stopped due to the fact that there is a need for more

resources, such as much more memory, to test the number of images that had

chosen.

Therefore, figures 3.15, 3.16, 3.17, and 3.18 show the results when applying the

experiment with the KNN classification. Figure 3.15 shows the output of running the

KNN program that tests that data by using different types of distance metrics, such as

correlation, spearman, and hamming, in order to find the best match, while figure 3.16
44
shows the output of the minimum objective versus number of function evaluations.

Moreover, figure 3.17 shows the different distance metrics calculations of testing

through 30 iterations, and it is indicating the evaluation results, whether it is acceptable,

best, or error. Thus, by having this information and the numbers of neighbors, the

program determines which one is the best. Figure 3.18 shows that the optimization is

completed. It presents the result of the best observed feasible point, which is when the

number of neighbors was 12 with correlation distance metrics while the best estimated

observed feasible point was when the number of neighbors was 16 with correlation

distance metrics.

45
Figure 3.15: The output of matching with different distence metrics

46
Figure 3.16: The output of function evaluations

47
Figure 3.17: The output of applying the KNN

48
Figure 3.18: KNN Results

Table 3.2: Classification results using K- Nearest Neighbors (KNN)

K- Nearest Neighbors (KNN)


Number of Number of Correct
Training Data Test Data Classification
Accuracy Rate (%)
1,000 200 39%

2,000 400 40%

3,000 600 38.16%

4,000 800 ----

49
3.3.3 Convolution Neural Network (CNN)

3.3.3.1 Overview of Convolution Neural Network (CNN)


In [30], a convolutional neural network (CNN or ConvNet) is considered as a well-

known algorithm that is used for deep learning. CNN is one type of machine learning

technique that is used to train a model with the way that classification tasks are executed

directly from different data types (e.g., video, text, images, or sound). CNN can be used

in many various applications based on computer vision and object recognition, such as

face recognition applications and self-driving vehicles.

Figure 3.19: Deep Learning Workflow of Classifying Images by Using CNN [30]

CNN is considered as one of the most useful classification techniques that are used

in finding patterns in images to identify the content of the image and recognize it, such

as recognizing the faces, objects, and scenes. CNN eliminates the need for doing the

50
feature extraction manually due to the ability of CNN to recognize the image by

identifying the image patterns and then classifying the images.

CNN is one of the techniques that provide an optimal architecture for object

detection and image recognition by finding image patterns (see Figure 3.19). CNN is

one of the most popular methods that is used in automated driving and facial recognition

development due to the evolution of GPUs and parallel computing. For instance, self-

driving cars have the ability to detect objects and define them, such as determining the

difference between a pedestrian and a street sign.

The three essential factors that make Convolutional Neural Network (CNN) one of

the most beneficial classification techniques for deep learning are described in [30],

which are:

• There is no need to do the feature extraction manually. The CNN identifies the

features directly.

• The state-of-the-art recognition results are produced by CNNs.

• The ability to retain the CNNs with new recognition tasks is provided, which

means that users can build on pre-existing networks.

3.3.3.2 The Experiment with Convolution Neural Network


(CNN)

CNN contains several layers, which are some different hidden layers that are

surrounded by an input layer and an output layer as is shown in Figure 3.20. Thus, these

layers can be tens or hundreds to detect various features of a specific image. Therefore,

51
increasing the number of the hidden layers obtains better results due to the intent of

learning specific features from the data.

Figure 3.20: CNN feature learning Layers

Each training image at different resolutions is filtered by some filters. The filter

operation starts with simple features, such as edges and brightness, and then increase the

complexity level until reaching the unique features that defined the object. Moreover,

each convoluted image has an output, and this output is used as an input for the next

layer. Therefore, the feature identification and classification of video, sound, image, and

text is implemented by CNN.

52
Figure 3.21: How the Convolution Neural Network (CNN) Works [30]

Some operations can be implemented by these layers in order to change data and

learn the specific features of the data as is shown in Figure 3.21. Therefore, three layers

in CNN are considered as the most popular layers, which are:

• Convolution Layer: The input images are put by the convolution layer over a

group of convolutional filters. Particular features from the images are activated

by those filters.

• Rectified linear unit (ReLU) Layer: In this layer, the positive values are

preserved, and negative values are mapped to zero. This step sometimes can be

indicated as activation due to the fact that only the active features will be

transferred to the next layer. Therefore, this layer increases speed processing

and training effectiveness.

• Pooling Layer: Nonlinear down-sampling is performed to simplify the output.

It means that the number of parameters that the network needs to learn is reduced.

53
Repeating these operations over tens or hundreds of layers will provide a good level

of identifying different features as is shown in Figure 3.21. In order to identify image

classifications, we used the SoftMax classifier. Therefore, the result will be improved

and provided a high accuracy based of those features that the CNN produced.

3.3.3.3 The Results

In our experiment, we applied the Convolution Neural Network (CNN) with two

different attempts that contain different numbers of training and testing data as is shown

in Table 3.3. The number of training and testing data were increased in the following

attempt. The two attempts are described below, which are:

• In the first attempt, the number of the training data was all data of four categories,

which is 20,000 images (5,000 images from each one of the ten classes), and the

testing data was 4,000 images (1,000 images from each one of the ten classes).

As a result, the correct classification accuracy rate while using CNN was

73.15%.

• In the second attempt, the number of training and testing data were increased.

The whole dataset is used, which means all the images of the ten different classes

in the CIFAR-10 of training and testing data (see Figure 3.22). Therefore, the

number of training data was 50,000 images, and the testing data was 10,000

images. As a result, the correct classification accuracy rate with CNN was

74.10%.

54
Figure 3.22: The code of including all categories of CIFAR-10 dataset when
applying CNN

Figure 3.23, 3.24, and 3.25 show part of the results of 7,500 iterations when applying

the CNN. The training of the data was on a single GPU, which is NVIDIA GeForce.

Those figures show the results of mini-batch accuracy, mini-batch loss, base learning

rate, and the time of each iteration.

Figure 3.24 shows the first nine matching images between the testing data and the

training data with showing the name of the class on the top of each image. Besides, the

green color shows the correct match, while the red color shows the wrong match.

Therefore, eight out of the nine images are matched correctly. There are three images

of ‘Deer’ and one image for each one of these classes, which are Frog, Bird, Cat, and

airplane. There are two images of ‘Horse’, one is matched correctly while the other not.

55
Figure 3.23: The output of applying the CNN

56
Figure 3.24: The output of applying the CNN

57
Figure 3.25: The output of applying the CNN

58
Figure 3.26: Showing matching images between test image and matched image

In Figure 3.24, the first nine matching images between the testing data and the training

data with showing the name of the class on the top of each image. Besides, the green

color shows the correct match, while the red color shows the wrong match.

Table 3.3: Classification results using Convolution Neural Network (CNN)

Convolution Neural Network (CNN)


Number of Number of Correct
Training Data Test Data Classification
Accuracy Rate (%)
20,000 4,000 73.15%

50,000 10,000 74.10%

59
Chapter 4
The previous chapter described the research methodology, which represented our

work in detail. It presented the research framework, the research structure, the three

different classification techniques that are used, and their results. This chapter describes

the analyzing results, which compares the results of the three classification techniques.

Therefore, this section reflects the goals of our fundamental research.

4.1 Analyzing Results


We tried different combinations for each the testing and the training data to

determine which procedure produces a better result, as is shown in table 4.1.

Among the principal component analysis (PCA) we found that the best accuracy

when the number of the training data was 1,000 (100 from each class of the ten classes),

and the testing data was 200 (20 from each class of the ten classes); the accuracy was

24.44% with 44 images that are matched correctly. However, we found more correct

matches of 139 images when the number of the training data was 4,000 (400 from each

class of the ten classes), and the testing data was 800 (80 from each class of the ten

classes); the accuracy was 19.31%. Consequently, increasing the number of training and

testing data is not necessary leads to increase the accuracy, but it increased the number

of the images that matched correctly. In table 4.1, we can see that the correct

classification accuracy rate decreased from 24.44% to 17.78% and then increased

60
slightly to 19.31% when the number of the training data was 1000, 3000, and 4000 and

the number of testing data was 200, 600, and 800, respectively.

On the other hand, among the K-Nearest Neighbors (KNN) we found the best

accuracy when the number of the training data was 2,000 (200 from each class of the ten

classes), and the testing data was 400 (40 from each class of the ten classes); the accuracy

was 40%. In table 4.1, increasing the number of training and testing data may cause a

partial increasing. However, when we attempted to run the KNN with 4,000 images for

the training data and 800 images for the testing data, the program would not be able to

complete the whole processor and its field.

However, we believe this cause by the fact that the computer that is used was not

a high-performance computer, and it has a limited amount of memory. Thus, the limited

amount of memory tries to run and build all the results in memory, but it ran out of the

memory and crashed. Therefore, there are no mistakes with applying the KNN since

the computer worked good with other two classification techniques, which are principal

component analysis (PCA) and convolutional neural network (CNN). It means that the

limitation of the machine is the reason why the program stopped the running, and we

would not be able to obtain the results. Thus, if the code is run in a different machine

that has a high performance with huge memory space, the result may be much better.

Additionally, the two classification techniques, which are the principal

component analysis (PCA) and the K-Nearest Neighbors (KNN), have the same number

of the training and testing data, but the results were different. The KNN produced better

results than PCA with all different combinations that are chosen for the training and

testing data (see Figure 4.1). For 1,000, 2,000, and 3,000 images of training data, the

61
KNN produced higher correct classification accuracy rate than PCA, which means the

KNN was above the PCA. Thus, if the computer was able to complete the run of 4,000

images for the training data and 800 images for the testing data, we can assume that the

KNN would also have provided better result because the results of the PCA were not

even better while using 1,000, 2,000, and 3,000 images for the training data.

PCA AND KNN RESULTS


PCA Accuracy % KNN Accuracy %

45 40
39 38.16
40
35
30
24.44
ACCURACY

25 20
17.78
20
15
10
5
0
1000 IMAGES 2000 IMAGES 3000 IMAGES

Figure 4.1: The chart of the PCA & KNN results

Furthermore, among the convolutional neural network (CNN), the approach that

is followed to choose the number of training and testing data was different from the PCA

and the KNN. In CNN, we choose a specific number of categories in order to notify the

differences of the results between using all the training data of chosen categories and

using all the training data of all the ten categories.

62
As a result, when the number of the training data was 20,000 a total of four

categories (5,000 images from each class), and the testing data was 4,000 images (1,000

images from each class), the accuracy was 73.15%, while when the whole dataset of the

training and testing data is used the accuracy was 74.10 %. It means the number of the

training data was 50.000 images, and the testing data was 10,000 images (all the training

and testing data of the CIFAR-10 dataset). However, there were not many differences

between limited the categories of training data and using all the categories in the dataset.

Consequently, there was a slight improvement in the accuracy when we used the 20,000

images and 50,000 images from 73.15% to 74.10%, respectively.

Moreover, improving the results while using the convolutional neural network

(CNN) was due to using the Graphics Processing Unit GPU. It means that the

extraordinary performance of CNN took advantage of using the GPU, which enhances

the performance and produces the highest accuracy results. The CNN algorithm based

on the GPU and the code that is used in this experiment was built on a library. There are

two main types of GPUs, which are Nvidia and AMD. The NVIDIA GeForce has one

library for the GPU while the AMD has a different library. The code that is used was

written for that particular library, which is NVIDIA GeForce. It means that this code

will only work with a machine that has GeForce. Therefore, if the code is run in a regular

processor, it will not run due to the need to have NVIDIA GeForce.

As a result, the K-Nearest Neighbors (KNN) provided better results than the

principal component analysis (PCA), which means that KNN was more accurate than

PCA, while the convolutional neural network (CNN) did even better with classifying

these tiny images of CIFAR-10 dataset. Moreover, the highest accuracy with the PCA

63
was 24.44% when the training data was 1,000 images, and the test data was 200 images,

while the highest accuracy with the KNN was 40% when the training data was 2,000

images, and the test data was 400 images. Therefore, among all of these three

classification techniques, the convolutional neural network (CNN) is significantly

improving the accuracy by producing the highest correct classification accuracy with

74.10% when all the images in each category of the CIFAR-10 dataset are used.

Furthermore, the larger size of data that is tried did give us the best matching.

However, it might be better by using more powerful computer with reasonable time.

Therefore, running larger data with more powerful computer may be done in the future.

It means running this on machines with more GPUs to see how that may improve the

performance in order to do the comparison.

64
Table 4.1: The results of the three different classification techniques.

Principal Component Analysis (PCA)


Number of Number of Number of Images Correct
Training Data Test Data Matched Correctly Classification
Accuracy Rate
(%)
1,000 200 44 24.44%

2,000 400 72 20%

3,000 600 96 17.78%

4,000 800 139 19.31%

K- Nearest Neighbors (KNN)


Number of Number of Correct Classification Accuracy Rate
Training Data Test Data (%)
1,000 200 39%

2,000 400 40%

3,000 600 38.16%

4,000 800 ----

Convolution Neural Network (CNN)


Number of Number of Correct Classification Accuracy Rate
Training Data Test Data (%)
20,000 4,000 73.15%

50,000 10,000 74.10%

65
Chapter 5

5.1 Conclusion and Future Work


As is shown in Chapter 4, three different classification techniques are tested in

order to identify the most accurate algorithm for classifying the tiny image of the CIFAR-

10 dataset. Therefore, the best results that are achieved was when we used the

convolutional neural network (CNN). As a result, CNN is the better classification

algorithm to use since it produced the best results matching approximately 74.10%

among the other two classification techniques that are used in this research, which are

PCA and KNN.

As we discussed in section 2.2, there was earlier work [33] that also examined

the CIFAR-10 dataset with some different classification techniques, such as principal

component analysis (PCA), K-Nearest Neighbors (KNN), and linear Support Vector

Machines (SVM). As a result, the researchers in [33] used the KNN algorithm, and their

result was 29.5%, while the KNN in our experiment provided 40% accuracy. It means

that our KNN produced a much better correct classification accuracy rate than what the

researcher presented in [33]. Furthermore, CNN produced the best results in our

experiment, and the results were 74.10%, whereas the researchers in [33] used the SVM,

and the SVM only produced 39.6%. Thus, it appears that CNN is the better algorithm

to use than the SVM due to the higher matching results and accuracy rate that the CNN

produced.

66
Table 5.1: A comparison of our results with Norko’s results [33] while using some
classification techniques.

Our Results with Using CIFAR-10 Dataset


Classification Techniques Number of Correct
Training Data Classification
Accuracy Rate (%)
Principal Component Analysis (PCA) 1,000 24.44%

K-Nearest Neighbors (KNN) 2,000 40%

Convolutional Neural Network (CNN) 50,000 74.10%

Norko [33] Results with Using CIFAR-10 Dataset


Classification Techniques Number of Correct
Training Data Classification
Accuracy Rate (%)
K-Nearest Neighbors (KNN) 10,000 29.5%

linear Support Vector Machines (SVM) 10,000 39.6%

The larger size of data that is tried did give us the best matching. Therefore, In

the future, it might be better by using a more powerful computer at a reasonable time.

Finding more resources that provide high performance with a larger amount of data may

be considered as one of the most critical steps that should be followed in future work.

Therefore, running larger data with a more powerful computer may be done in the future.

It means running this on machines with more GPUs to see how that may improve the

performance in order to make the comparison and provide high correct classification

accuracy rate.

Furthermore, we will try to use some different feature reduction and feature

extraction techniques in order to keep the most useful features of the actual data. Each

method will produce different features and that may improve the matching results due to

67
the fact that those methods minimize the number of features in training set to a smaller

number with keeping the useful features.

Therefore, the accuracy rate of the classification techniques may be increased

and provide much accurate results. Moreover, we will try to apply the same

classification techniques that have been examined in this research with other datasets in

order to identify how they will perform. Therefore, we will try other different

classification techniques with different datasets in order to identify their performance

and the correct classification accuracy rate of each one of them.

68
Chapter 6

6.1 Bibliography
[1] Ahmed, I., Lhee, K. S., Shin, H. J., & Hong, M. P. (2011, January). Fast content-
based file type identification. In IFIP International Conference on Digital
Forensics (pp. 65-75). Springer, Berlin, Heidelberg.

[2] Gopal, S., Yang, Y., Salomatin, K., & Carbonell, J. (2011, December). Statistical
learning for file-type identification. In 2011 10th international conference on
machine learning and applications and workshops (Vol. 1, pp. 68-73). IEEE.

[3] Amirani, M. C., Toorani, M., & Mihandoost, S. (2013). Feature‐based Type
Identification of File Fragments. Security and Communication Networks, 6(1), 115-
128.

[4] CaoD, Luo J, YinM,Yang H. Feature selection based file type identification
algorithm. In 2010 IEEE International Conference on Intelligent Computing and
Intelligent Systems (ICIS), Xiamen, China, 2010; 58–62.

[5] Al Fahdi, M., Clarke, N. L., & Furnell, S. M. (2013, August). Challenges to digital
forensics: A survey of researchers & practitioners attitudes and opinions. In 2013
Information Security for South Africa (pp. 1-8). IEEE.

[6] Quick, D., & Choo, K. K. R. (2014). Impacts of increasing volume of digital
forensic data: A survey and future research challenges. Digital Investigation, 11(4),
273-294.

[7] Damshenas, M., Dehghantanha, A., & Mahmoud, R. (2014). A survey on digital
forensics trends. International Journal of Cyber-Security and Digital
Forensics, 3(4), 209-235.

[8] Dezfoli, F. N., Dehghantanha, A., Mahmoud, R., Sani, N. F. B. M., & Daryabar,
F. (2013). Digital forensic trends and future. International Journal of Cyber-Security
and Digital Forensics, 2(2), 48-77.

[9] Meghanathan, N., Allam, S. R., & Moore, L. A. (2010). Tools and techniques for
network forensics. arXiv preprint arXiv:1004.0570.

69
[10] Khanikekar, Sandeep Kumar. "Web Forensics." PhD diss., Texas A&M
University-Corpus Christi, 2010.

[11] Guidance Software, “EnCase Forensic v7”, Guidance Software. Retrieved from
http://www.guidancesoftware.com/encase-forensic.htm [Accessed 21/5/13].

[12] AccessData, “FTK – Forensic Toolkit”, AccessData. Retrieved from


http://www.accessdata.com/products/digital-forensics/ftk [Accessed 21/5/13].

[13] Ghafarian, A. (2014, March). An empirical study of network forensics analysis


tools. In ICCWS2014-9th International Conference on Cyber Warfare & Security (p.
366).

[14] Pilli, E. S., Joshi, R. C., & Niyogi, R. (2010). Network forensic frameworks:
Survey and research challenges. digital investigation, 7(1-2), 14-27.

[15] Kieseberg, P., Schrittwieser, S., Mulazzani, M., Huber, M., & Weippl, E. (2011,
September). Trees cannot lie: Using data structures for forensics purposes. In 2011
European Intelligence and Security Informatics Conference (pp. 282-285). IEEE.

[16] Khanuja, H. K., & Adane, D. S. (2014, September). Forensic analysis for
monitoring database transactions. In International Symposium on Security in
Computing and Communication (pp. 201-210). Springer, Berlin, Heidelberg.

[17] Jun, T. (2010). On Developing Intelligent Surveillant System of Suspicious


Financial Transaction. In 2010 2nd International Conference on E-business and
Information System Security.

[18] Bertino, E., & Sandhu, R. (2005). Database security-concepts, approaches, and
challenges. IEEE Transactions on Dependable and secure computing, (1), 2-19.

[19] Berger, U. (2011). CSI/FBI Computer Crime and Security Survey 2011-
2012. CSI Computer Security Institute.

[20] Raghavan, S. (2013). Digital forensic research: current state of the art. CSI
Transactions on ICT, 1(1), 91-114.

[21] Khanuja, H. K., & Adane, D. S. (2011). Database security threats and challenges
in database forensic: A survey. In Proceedings of 2011 International Conference on
Advancements in Information Technology (AIT 2011), available at http://www.
ipcsit. com/vol20/33-ICAIT2011-A4072. pdf.

70
[22] Khanuja, H. K., & Adane, D. S. (2012). A framework for database forensic
analysis. Computer Science & Engineering: An International Journal (CSEIJ), 2(3),
27-41.

[23] Pavlou, K. E., & Snodgrass, R. T. (2008). Forensic analysis of database


tampering. ACM Transactions on Database Systems (TODS), 33(4), 30.

[24] Khanuja, H. K., & Adane, D. S. (2014, September). Forensic analysis for
monitoring database transactions. In International Symposium on Security in
Computing and Communication (pp. 201-210). Springer, Berlin, Heidelberg.

[25] Pavlou, K. E., & Snodgrass, R. T. (2012, April). DRAGOON: An information


accountability system for high-performance databases. In 2012 IEEE 28th
International Conference on Data Engineering (pp. 1329-1332). IEEE.

[26] Hunt, R. (2012, December). New developments in network forensics—Tools and


techniques. In 2012 18th IEEE International Conference on Networks (ICON) (pp. 376-
381). IEEE.

[27] Hunt, R., & Zeadally, S. (2012). Network forensics: an analysis of techniques, tools,
and trends. Computer, 45(12), 36-43.

[28] Karamizadeh, S., Abdullah, S. M., Manaf, A. A., Zamani, M., & Hooman, A.
(2013). An overview of principal component analysis. Journal of Signal and Information
Processing, 4(03), 173.

[29] Meng, J., & Yang, Y. (2012). Symmetrical two-dimensional PCA with image
measures in face recognition. International Journal of Advanced Robotic Systems, 9(6),
238.

[30] Convolutional Neural Network: 3 things you need to know (n.d.). Retrieved from
https://www.mathworks.com/solutions/deep-learning/convolutional-neural-
network.html#matlab

[31] Krizhevsky, A., & Hinton, G. (2009). Learning multiple layers of features from tiny
images (Vol. 1, No. 4, p. 7). Technical report, University of Toronto.

[32] Theodoridis, S., & Koutroumbas, K. (1998). Pattern recognition.

[33] Norko, A. (2015). Simple image classification using principal component analysis
(PCA). GMU Volgenau School of Engineering, Fairfax, VA, USA, 9.
71
[34] Vulinović, K., Ivković, L., Petrović, J., Skračić, K., & Pale, P. (2019, January).
Neural Networks for File Fragment Classification. In 42nd International Convention for
Information and Communication Technology, Electronics and Microelectronics-
MIPRO.

[35] Al-Saffar, A. A. M., Tao, H., & Talab, M. A. (2017, October). Review of deep
convolution neural network in image classification. In 2017 International Conference
on Radar, Antenna, Microwave, Electronics, and Telecommunications (ICRAMET) (pp.
26-31). IEEE.

[36] Chen, Q., Liao, Q., Jiang, Z. L., Fang, J., Yiu, S., Xi, G., ... & Liu, D. (2018, May).
File Fragment Classification Using Grayscale Image Conversion and Deep Learning in
Digital Forensics. In 2018 IEEE Security and Privacy Workshops (SPW) (pp. 140-147).
IEEE.

[37] Weinberger, K. Q., Blitzer, J., & Saul, L. K. (2006). Distance metric learning for
large margin nearest neighbor classification. In Advances in neural information
processing systems (pp. 1473-1480).

72

You might also like