0% found this document useful (0 votes)
63 views26 pages

Endsem Project Report B16

endsem

Uploaded by

Deshapaga Anish
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views26 pages

Endsem Project Report B16

endsem

Uploaded by

Deshapaga Anish
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Deep learning approach for football event

classification using Convolutional Autoencoder and


Image classifier networks

Submitted in partial fulfillment of the requirements of the degree of

(Bachelor of Technology)

by

Ankith Suresh-184204

Deshapaga Sindhuja-184215

Tejas Bhat Bellare-184264

Under the Guidance of

Dr. Mohammad Farukh Hashmi

Assistant Professor

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

NATIONAL INSTITUTE OF TECHNOLOGY, WARANGAL

2021-2022
PROJECT WORK APPROVAL FOR B.TECH

The Project Work Entitled “Deep learning approach for football event
classification using Convolutional Autoencoder and Image classifier networks”
is a bonafide record of work carried out by “Ankith Suresh(184204), Deshapaga
Sindhuja (184215), Tejas Bhat Bellare (184264) ” has been approved for the
degree of Bachelor of Technology, Department of Electronics and
Communication Engineering.

Examiners:

Supervisor:

Dr. Mohammad Farukh Hashmi

Assistant Professor, Department of Electronics and Communication


Engineering
Chairman :

Dr. Anjaneyulu L

Head, Department of Electronics and Communications Engineering

Date : 14/12/2021
DECLARATION

I declare that this written submission represents my ideas in my own words


and where others' ideas or words have been included, I have adequately cited
and referenced the original sources. I also declare that I have adhered to all
principles of academic honesty and integrity and have not misrepresented or
fabricated or falsified any idea/data/fact/source in my submission. I
understand that any violation of the above will be cause for disciplinary
action by the Institute and can also evoke penal action from the sources which
have thus not been properly cited or from whom proper permission has not
been taken when needed.

Ankith Suresh

184204

D.Sindhuja

184215

Tejas Bhat Bellare

184264
CERTIFICATE

This is to certify that the dissertation work entitled “Deep learning approach
for football event classification using Convolutional Autoencoder and Image
classifier networks” is a bonafide record of work carried out by “Ankith
Suresh(184204), Deshapaga Sindhuja (184215), Tejas Bhat Bellare (184264) ”,

submitted to the faculty of “Electronics and Communication Engineering “, in


partial fulfillment of the requirements for the award of the degree of Bachelor
of Technology in “B-Tech” at National Institute of Technology, Warangal
during the academic year 2018-2022.

Dr.Anjaneyulu.L Dr. Mohammad Farukh Hashmi

Head of the department Assistant Professor

Department of Electronics Department of Electronics and


Communication Engineering Communication Engineering

NIT Warangal NIT Warangal


CONTENTS

1. ABSTRACT 6

2. INTRODUCTION 7

3. LITERATURE REVIEW 8

4. DATASET DESCRIPTION 14

5. WORKFLOW 15

6. WORK DONE 16

7. TRAINING RESULTS 20

8. NEXT STEPS 24

9. REFERENCES 25
1. Abstract

Event detection and classification in sports is particularly important in the field of sports data
analytics. Today, various AI methods are utilized to detect events in a football game. The use of
advanced computational techniques in this area can help to achieve higher accuracy in detecting
events. In this project we propose a novel architecture used to classify events in a football
(soccer) game. The dataset used consists of numerous images pertaining to different events in a
standard football match. From this data set the objective is to detect specific and semantically
meaningful events like pass, kick or shoot, etc. The model proposed consists of a convolutional
autoencoder coupled with an Image classifier network to classify the given images into the
defined events. The convolutional autoencoder module is placed before the image classifier in
the model pipeline as it helps with compression and dimensionality reduction so as to ensure a
computationally fast implementation.
2. Introduction
Football is widely regarded as one of the most popular sports in the world. The popularity of the
sport has gathered many spectators. Numerous studies and research is being done in this area to
grow and assist this sport and meet the needs of football franchises, media and stakeholders.
These researches mainly focus on estimating team tactics, player analytics, analysis of the sports
environment (field conditions, weather conditions, stadium records etc) and the detection of
many other events that occur in the match. Data-driven decisions play a significant role in soccer
and many other sports. Collecting and properly handling quality data from a soccer match is,
therefore, of immense value.

Machine Learning can assist in conducting the above-mentioned research in order to achieve
accurate and more optimized results. Deep Learning, an important facet of AI and ML can be
employed to unearth new and advanced techniques to help in this research.

2.1 Applications
The data typically collected from a football game includes: goals scored, assists, tackles, number
of shots on target, possession information, corners, off sides, fouls, cards given, injuries, player
records, substitutions, etc. There are several applications for event driven systems based on
image data.

Detecting events in football games also help with computing player and match statistics, football
club development analysis, sports media presentations and many other useful metrics that can be
used inside and outside of the sport

Counting the number of free kicks, fouls, tackles, etc. in a football game can be done manually.
Using manpower is not only costly and time consuming, but also may be associated with errors.
However, with intelligent systems based on event detection, these statistics can be calculated and
used automatically within minutes.

Other applications of event detection may be the summarization of a football match. The
summary of a football match includes important events in which the match took place. To
prepare a useful summary of a football match, the events should be correctly identified. Using an
advanced method for identifying the events can improve the quality of the summarization task.
3. Literature review

3.1 Event Detection and Classification in Football

Before the discovery of Deep Learning methods, sports analysis, especially football video
analysis, had been classified into two categories: object tracking and pattern recognition [2]. The
use of customized cameras [1] results in computational cost in case of object tracking, whereas
the pattern recognition methodology simply extracts lower-level features and then uses a
classifier to detect higher level events.
An approach proposed by [3] included categorization of events into distinct categories like shoot,
goal, etc. Such an approach includes feature extraction and heuristic rules for detecting events.
They perform low-level analysis to detect marks (field, lines, logo, arcs, and goalmouth), player
positions, ball position, etc, and then derive mid-level features using these cues. In the end, they
developed a rule-based system to detect salient events like the goal, corner, etc.

Deep Learning methods such as Convolutional Neural Networks and Restricted Boltzmann
Machines have been successfully used for event detection. CNNs have shown better performance
in image classification, object detection and modeling high-level visual semantics.

Owing to this, The authors in [4] present a method for detecting soccer events using the Bayesian
network. The basic methods presented suffered from low accuracy, until some methods have
been proposed using DL. By presenting a method based on the convolutional network and the
LSTM network, [5] present a method for summarizing a soccer match based on event detection,
in which five events including corner kicks, free kicks, goal scenes, centerline, and throw-in are
considered. This study uses 3D-ResNet34 architecture in the convolutional network structure.
One of the problems with this work is that no highlights are taken into account.
Another literature proposed by [13] presents a deep learning approach for identifying major
events in soccer images. Their approach introduces the concept of using an Autoencoder as a
precursor to image classification. The use of the Autoencoder helps to compress the image and
reduce any computational constraints due to large image dimensionality.
However the Variational Autoencoder (VAE) used in their model generally has a disadvantage of
obtaining blurry output due to the injected noise and imperfect reconstruction.

This encouraged us to modify our approach and use a new architecture for the Autoencoder
stage. We built a Convolutional Autoencoder for the image encoding process as the convolution
operator allows filtering an input signal in order to extract some part of its content. Autoencoders
in their traditional formulation do not take into account the fact that an image can be seen as a
combination of other images. Convolutional Autoencoders, instead, use the convolution operator
to exploit this observation. They learn to encode the input in a set of simple images and then try
to reconstruct the input from them. Convolutional Autoencoders can retain spatial and temporal
information.

3.2 Autoencoder

The traditional Autoencoder (AE) framework consists of three layers, one for inputs, one for
latent variables, and one for outputs. The clear definition of this framework first appeared in [6].
This type of network consists of the following parts:

Encoder:- The part of the network that compresses the input into a space of latent variables. This
can be represented by the encoding function

h=f(x)

Code Layer:- This part of the network contains the reduced representation of the input that is fed
into the decoder.

Decoder:- The part that tries to reconstruct the input based on previously collected
information. It is represented by the decoding function
r=g (h)

where h is the latent variable, r is reconstructed data from the latent space, which belongs to the
same space with the input x. The training procedure is done by minimizing the reconstruction
error using back-propagation. This framework was initially proposed to achieve dimensionality
reduction.

With the same purpose, [7] proposed a deep autoencoder architecture, where the encoder and the
decoder are multi-layer deep networks.

3.2.1 CNN and Convolutional Autoencoders

Convolutional Neural Networks (CNN, or ConvNet) are some of the best known neural networks
for modeling image data. It is very efficient in retaining the connected information between the
pixels of an image. The particular design of the layers in a CNN makes it a better choice to
process image data.

When CNN is used for image noise reduction or coloring, it is applied in an Autoencoder
framework, i.e, the CNN is used in the encoding and decoding parts of an Autoencoder. Each of
the input image samples is an image with noises, and each of the output image samples is the
corresponding image without noises. We can apply the trained model to a noisy image then
output a clear image. This is especially useful when we want to denoise image data for more
optimized outputs.

Instead of stacking the data, the Convolution Autoencoders keep the spatial information of the
input image data as they are, and extract information gently using the Convolution layer. This
process is designed to retain the spatial relationships in the data. This is the encoding process in
an Autoencoder. In the middle, there is a fully connected autoencoder with hidden layers. After
that comes the decoding process that reconstructs the image. The encoder and the decoder are
symmetric.

3.3 Image Classification Models

Deep Learning techniques propose several image classification models to recognize/classify


images. Some of these techniques employ gradient based back propagation algorithms while
others use algorithms such as Moore-Penrose generalized inverse [12] to set their weights.

A few of these image classification models were reviewed and their applications were studied in
order to integrate the most suitable model into our architecture.

3.3.1 Extreme Learning Machine

Extreme learning machine (ELM) is a training algorithm for single hidden layer feedforward
neural network (SLFN), which converges much faster than traditional methods and yields
promising performance.[8].

Unlike traditional feed forward network learning algorithms like back-propagation (BP)
algorithm, the ELM does not use a gradient-based technique. ELM does not require optimization
for the classification parameters as different from ANN. The classification parameters including
output weights, weight of hidden neurons are generated according to the input feature set and
class labels. The generated output and input weights are never tuned.

The model of ELM constitutes the input layer, single-hidden layer, and output layer. The model
structure of ELM is shown in Figure below, with k input layer nodes, j hidden layer nodes, i
output layer nodes, and the hidden layer activation function g(x).
Some of the types of ELMs discussed in [9] include:

● Basic Extreme Learning Machine (Basic ELM).


● Extreme Learning Machine-Autoencoder (ELM-AE).
● Multilayer Extreme Learning Machine (MLELM).

Basic Extreme Learning Machine (Basic ELM)


The model of ELM constitutes the input layer, single-hidden layer, and output layer. The model
structure of ELM is shown in Figure below, with j input layer nodes, n hidden layer nodes, m
output layer nodes, and the hidden layer activation function g(x). For N distinct samples

Xi ∈ RN × Rj ,yi ∈ RN × Rm (i=1,2,..,N)

The outputs of the hidden layer can be expressed as:

h=g(ax + b)
Fig 1. The model structure of ELM.

Extreme Learning Machine-Autoencoder (ELM-AE)


The model of ELM-AE consists of the input layer, single-hidden layer, and output layer. The
model structure of ELM-AE is shown in Figure below , with j input layer nodes, n hidden layer
nodes, j output layer nodes, and the hidden layer activation function.

Fig 2. The model structure of ELM-AE.


3.3.2 Artificial Neural Networks

ANN [10] is a set of connected input output networks in which weight is associated with each
connection. It consists of one input layer, one or more intermediate layers and one output layer.
Learning of neural networks is performed by adjusting the weights and biases of the connection.
By updating the weight iteratively performance of the network is improved.

The behavior of neural networks is affected by learning rules, architecture, and transfer function.
Neurons of neural networks are activated by the weighted sum of input. The activation signal is
passed through a transfer function to produce a single output of the neuron. Non linearity of the
network is produced by this transfer function. During training, the inter-connection weights are
optimized until the network reaches the specified level of accuracy.

The Convolutional Neural Network (CNN) [14] is one of the most popular neural network
designs used for image classification or image noise reduction and coloring. The CNN model is
trained by taking many image samples as the inputs and labels as the outputs. This trained CNN
model is then applied to a new image to recognize if the label is correct or not. The
Convolutional Neural network involves several intermediate steps such as pooling and padding
of the image in order to help with dimensionality reduction.

Fig 3. CNN Architecture [11]


4. Dataset Description

The Soccer Event Dataset [13] used consists of numerous images pertaining to a football (soccer)
game. The images include various events that generally occur in a standard football match. From
these events we have taken 4 commonly seen occurrences as our image labels. These events are
namely:
● Free Kick
● Tackle
● Penalty
● Corner Kick

In the present project, two image datasets for a football match were collected:

Summary of Data Description

Class Training Images Testing Images

Free Kick 2000 500

Tackle 2000 500

Penalty 2000 500

Corner Kick 2000 500

Total 8000 2000


5. Workflow
6. Work Done

6.1 Project overview


Initially, in this stage a brief review of topics related to the project was done. The dataset
required for the project was collected and analyzed on the basis of size, format, fraction of the
original dataset that is required for our project,etc.

Then, the literature survey as well as relevant coding knowledge required to implement the
project was reviewed and refreshed.

6.2 Review of individual project components


In this stage, brief research was done on i) Extreme Learning Machines and ii) Autoencoders to
understand the concepts, different types and its applications.
The concepts used in object detection and classification were also studied and understood.

6.3 Dataset Preparation and Preprocessing

The dataset collected for the project consists of various events in a football match and is
categorized into 4 classes namely tackle, free kick, penalty and corner.
Each image and their corresponding labels were stored in two separate lists namely data and
target.
All the images were then resized to a size of 224x224 and normalized. These images were then
saved in the form of numpy arrays
● The train set consists of a total of 8000 images with the dimension of each image being
224x224x3 (corresponds to the RGB layers of the image)
● The test set consists of a total of 2000 images also with dimensions of 224x224x3

6.4 Building and Testing the Autoencoder

The first stage of the code implementation involved in creating the Convolutional Autoencoder
The coding was done on python using the Tensorflow library as backend for deep learning
applications. In this stage, the concepts related to autoencoder were thoroughly researched and
understood before beginning the implementation.

The construction of the autoencoder is explained as follows. The convolutional autoencoder


consists of two parts, an encoder and a decoder.
Encoder
● The encoder consists of 2 convolutional layers.
● The input layer consists of a Conv2D layer with 16 kernels of 3x3 dimensions and leaky
ReLu as the activation function followed by a max pooling layer of 2x2 dimensions.
● This layer is followed by another Conv2D layer of 8 kernels of 3x3 dimensions and leaky
ReLU activation function followed by a max pooling layer of 2x2 dimensions.

Decoder
● The decocoder consists of a Conv2D_transpose layer where the weights are transposed
and flipped by 180o followed by an upsampling layer.
● This layer is followed by another Conv2D_transpose layer and an upsampling layer.
● The output layer consists of 3 kernels of 3x3 dimensions with sigmoid as the activation
function.

The figure below shows the summary of the Autoencoder architecture

Fig 4. Convolutional Autoencoder model architecture

The model was compiled on ‘Mean square error’ as the loss function, ‘Adam’ as the optimizer
and ‘Accuracy’ as the evaluation metric.A portion of the test data was also used for validation
during training . The model was then trained over 50 epochs with a batch size of 32.
6.5 Building the Image Classification Network

Proposed Classifier 1

Extreme Learning Machine (ELM)


The model proposed here consists of an Autoencoder followed by an Extreme Learning Machine
Classifier. The encoded data is passed through the ELM classifier to get the classified output.
The advantage offered by this model is that in addition to the data compression feature provided
by the Autoencoder, the training time is reduced drastically. The ELM provides fast training
with simplest mathematical solutions.
The theorem of ELM suggests using randomly stated input weights for fast learning and the
universal estimation ability using least mean square algorithm for defining outputs. The number
of hidden layers of the network were chosen on a trial and error basis to get the highest accuracy.
The product of the image dimensions was taken as the input length.

The model employs sigmoid as the activation function and was evaluated using mean square
error.

The figure below shows the specifications used to initialize the ELM model.

Fig 5. ELM network specifications

Proposed Classifier 2

Custom VGG16 model (Transfer Learning):


VGG16 is a pre-trained convolutional neural network that is used in image classification and
detection problems. The model was trained on the ImageNet dataset. The VGG architecture
consists of a stack of convolutional layers, five max pooling layers that help in spatial pooling
(Max-pooling is performed over a 2×2 pixel window, with stride 2) and three fully connected
dense layers. The final layer is the soft-max layer. The configuration of the fully connected
layers is the same in all networks and all hidden layers are equipped with the rectification
(ReLU) non-linearity.
For the purpose of our classification problem some of the existing layers of the VGG model were
customized thereby providing for a more comprehensive and tailored training process. The final
layers of the model were replaced with the three customized dense layers followed by the output
layer. The VGG model was then redefined with these integrated custom layers to suit the
problem at hand. This enabled us to leverage the pre trained model advantages whilst also
ensuring compatibility.

Fig 6. Custom VGG16 Model Architecture


7. Training Results

1. Autoencoder

Loss Accuracy

Training 0.0050 85.10%

Validation 0.0040 91.98%

2. Extreme Learning Machine with Autoencoder

Training time: 8.322 seconds

Loss Accuracy

Training 0.0160 98.15%

Validation 0.0991 60.93%

3. Custom VGG16 model

Training time: 1 hour 4 minutes

Loss Accuracy

Training 0.1627 94.50%

Validation 0.5399 86.25%


Plots and simulation results

1. Autoencoder

Fig 7. Plots showing the training and validation loss and accuracy with increasing epochs for
Autoencoder

2. Extreme Learning Machine with Autoencoder

Fig 8. Simulation Result for ELM with Autoencoder model training


3. Custom VGG 16 model

Fig 9. Plots showing the training and validation loss and accuracy with increasing epochs for
custom VGG16 model

Autoencoder Sample Test Case

Original image Reconstructed image


ELM with Autoencoder Classification Results (Confusion Matrix)

CNN Classification Results

Fig 10. Classification results of the VGG 16 model for sample images
8. Next Steps
It can be noted that while the ELM model offers lower accuracy the training time is extremely
less. This is in contrast to the CNN classifier that offers relatively higher accuracy but longer
training time.

Thus the next steps include optimizing both the above mentioned architectures to minimize the
tradeoff between accuracy and training time. The ways to optimize the model would be to either
test different activation functions, feed more data or tune model hyperparameters. Based on the
results obtained, the better performing architecture will then be finalized and will be put through
one more round of fine tuning to achieve the expected results.

The final model should be able to encode the large image set efficiently and accurately
distinguish between the 4 events described in our report whilst also having lower computational
time.
9. References:

[1] Jia Liu, Xiaofeng Tong, Wenlong Li, Tao Wang, Yimin Zhang, and Hongqi Wang. Automatic
player detection, labeling and tracking in broadcast soccer video. Pattern Recognition Letters,
30(2):103- 113, 2009

[2] Chung-Lin Huang, Huang-Chia Shih, and Chung-Yuan Chao. Semantic analysis of soccer
video using dynamic bayesian network. IEEE Transactions on Multimedia, 8(4):749-760, 2006.

[3] Xueming Qian, Guizhong Liu, Huan Wang, Zhi Li, and Zhe Wang. Soccer video event
detection by fusing middle level visual semantics of an event clip. In Pacific-Rim Conference on
Multimedia, pages 439-451. Springer, 2010

[4] M. Tavassolipour, M. Karimian, and S. Kasaei, “Event detection and summarization in soccer
videos using bayesian network and copula,”IEEE Transactions on circuits and systems for video
technology, vol. 24, no. 2, pp. 291–304, 2013.

[5] R. Agyeman, R. Muhammad, and G. S. Choi, “Soccer video summarization using deep
learning,” in 2019 IEEE Conference on Multimedia Information Processing and Retrieval
(MIPR). IEEE, 2019, pp. 270–273.

[6]: P. Baldi and K. Hornik. Neural networks and principal component analysis: Learning
from examples without local minima. , 2(1):53–58, January 1989.

[7]: Geoffrey Hinton and Ruslan Salakhutdinov. Reducing the dimensionality of data with
neural networks. , 313(5786):504 – 507, 2006.

[8]:Wang, J., Lu, S., Wang, SH. et al. A review on extreme learning machine. Multimed Tools
Appl (2021). https://doi.org/10.1007/s11042-021-11007-7

[9]: Ding, Shifei & Zhang, Nan & Xu, Xinzheng & Guo, Lili & Zhang, Jian. (2015). Deep Extreme
Learning Machine and Its Application in EEG Classification. Mathematical Problems in
Engineering. 2015. 1-11. 10.1155/2015/129021.

[10] A. K. Jain, J. Mao, and K. M. Mohiuddin, “Artificial neural networks: A tutorial,” Computer,
vol. 29, no. 3, pp. 31–44, 1996.

[11]: I. Tabian, H. Fu, and Z. S. Khodaei, “A Convolutional Neural Network for Impact Detection
and Characterization of Complex Composite Structures,” Sensors, vol. 19, no. 22, p. 4933, Nov.
2019.
[12]: Ben-Israel, Adi. (2002). The Moore of the Moore-Penrose Inverse. ELA. The Electronic
Journal of Linear Algebra [electronic only]. 9. 10.13001/1081-3810.1083.

[13]: Karimi, Ali & Toosi, Ramin & Akhaee, Mohammad. (2021). Soccer Event Detection Using
Deep Learning.

[14]: Krizhevsky, Alex & Sutskever, Ilya & Hinton, Geoffrey. (2012). ImageNet Classification with
Deep Convolutional Neural Networks. Neural Information Processing Systems. 25.
10.1145/3065386.

You might also like