Endsem Project Report B16
Endsem Project Report B16
(Bachelor of Technology)
by
Ankith Suresh-184204
Deshapaga Sindhuja-184215
Assistant Professor
2021-2022
PROJECT WORK APPROVAL FOR B.TECH
The Project Work Entitled “Deep learning approach for football event
classification using Convolutional Autoencoder and Image classifier networks”
is a bonafide record of work carried out by “Ankith Suresh(184204), Deshapaga
Sindhuja (184215), Tejas Bhat Bellare (184264) ” has been approved for the
degree of Bachelor of Technology, Department of Electronics and
Communication Engineering.
Examiners:
Supervisor:
Dr. Anjaneyulu L
Date : 14/12/2021
DECLARATION
Ankith Suresh
184204
D.Sindhuja
184215
184264
CERTIFICATE
This is to certify that the dissertation work entitled “Deep learning approach
for football event classification using Convolutional Autoencoder and Image
classifier networks” is a bonafide record of work carried out by “Ankith
Suresh(184204), Deshapaga Sindhuja (184215), Tejas Bhat Bellare (184264) ”,
1. ABSTRACT 6
2. INTRODUCTION 7
3. LITERATURE REVIEW 8
4. DATASET DESCRIPTION 14
5. WORKFLOW 15
6. WORK DONE 16
7. TRAINING RESULTS 20
8. NEXT STEPS 24
9. REFERENCES 25
1. Abstract
Event detection and classification in sports is particularly important in the field of sports data
analytics. Today, various AI methods are utilized to detect events in a football game. The use of
advanced computational techniques in this area can help to achieve higher accuracy in detecting
events. In this project we propose a novel architecture used to classify events in a football
(soccer) game. The dataset used consists of numerous images pertaining to different events in a
standard football match. From this data set the objective is to detect specific and semantically
meaningful events like pass, kick or shoot, etc. The model proposed consists of a convolutional
autoencoder coupled with an Image classifier network to classify the given images into the
defined events. The convolutional autoencoder module is placed before the image classifier in
the model pipeline as it helps with compression and dimensionality reduction so as to ensure a
computationally fast implementation.
2. Introduction
Football is widely regarded as one of the most popular sports in the world. The popularity of the
sport has gathered many spectators. Numerous studies and research is being done in this area to
grow and assist this sport and meet the needs of football franchises, media and stakeholders.
These researches mainly focus on estimating team tactics, player analytics, analysis of the sports
environment (field conditions, weather conditions, stadium records etc) and the detection of
many other events that occur in the match. Data-driven decisions play a significant role in soccer
and many other sports. Collecting and properly handling quality data from a soccer match is,
therefore, of immense value.
Machine Learning can assist in conducting the above-mentioned research in order to achieve
accurate and more optimized results. Deep Learning, an important facet of AI and ML can be
employed to unearth new and advanced techniques to help in this research.
2.1 Applications
The data typically collected from a football game includes: goals scored, assists, tackles, number
of shots on target, possession information, corners, off sides, fouls, cards given, injuries, player
records, substitutions, etc. There are several applications for event driven systems based on
image data.
Detecting events in football games also help with computing player and match statistics, football
club development analysis, sports media presentations and many other useful metrics that can be
used inside and outside of the sport
Counting the number of free kicks, fouls, tackles, etc. in a football game can be done manually.
Using manpower is not only costly and time consuming, but also may be associated with errors.
However, with intelligent systems based on event detection, these statistics can be calculated and
used automatically within minutes.
Other applications of event detection may be the summarization of a football match. The
summary of a football match includes important events in which the match took place. To
prepare a useful summary of a football match, the events should be correctly identified. Using an
advanced method for identifying the events can improve the quality of the summarization task.
3. Literature review
Before the discovery of Deep Learning methods, sports analysis, especially football video
analysis, had been classified into two categories: object tracking and pattern recognition [2]. The
use of customized cameras [1] results in computational cost in case of object tracking, whereas
the pattern recognition methodology simply extracts lower-level features and then uses a
classifier to detect higher level events.
An approach proposed by [3] included categorization of events into distinct categories like shoot,
goal, etc. Such an approach includes feature extraction and heuristic rules for detecting events.
They perform low-level analysis to detect marks (field, lines, logo, arcs, and goalmouth), player
positions, ball position, etc, and then derive mid-level features using these cues. In the end, they
developed a rule-based system to detect salient events like the goal, corner, etc.
Deep Learning methods such as Convolutional Neural Networks and Restricted Boltzmann
Machines have been successfully used for event detection. CNNs have shown better performance
in image classification, object detection and modeling high-level visual semantics.
Owing to this, The authors in [4] present a method for detecting soccer events using the Bayesian
network. The basic methods presented suffered from low accuracy, until some methods have
been proposed using DL. By presenting a method based on the convolutional network and the
LSTM network, [5] present a method for summarizing a soccer match based on event detection,
in which five events including corner kicks, free kicks, goal scenes, centerline, and throw-in are
considered. This study uses 3D-ResNet34 architecture in the convolutional network structure.
One of the problems with this work is that no highlights are taken into account.
Another literature proposed by [13] presents a deep learning approach for identifying major
events in soccer images. Their approach introduces the concept of using an Autoencoder as a
precursor to image classification. The use of the Autoencoder helps to compress the image and
reduce any computational constraints due to large image dimensionality.
However the Variational Autoencoder (VAE) used in their model generally has a disadvantage of
obtaining blurry output due to the injected noise and imperfect reconstruction.
This encouraged us to modify our approach and use a new architecture for the Autoencoder
stage. We built a Convolutional Autoencoder for the image encoding process as the convolution
operator allows filtering an input signal in order to extract some part of its content. Autoencoders
in their traditional formulation do not take into account the fact that an image can be seen as a
combination of other images. Convolutional Autoencoders, instead, use the convolution operator
to exploit this observation. They learn to encode the input in a set of simple images and then try
to reconstruct the input from them. Convolutional Autoencoders can retain spatial and temporal
information.
3.2 Autoencoder
The traditional Autoencoder (AE) framework consists of three layers, one for inputs, one for
latent variables, and one for outputs. The clear definition of this framework first appeared in [6].
This type of network consists of the following parts:
Encoder:- The part of the network that compresses the input into a space of latent variables. This
can be represented by the encoding function
h=f(x)
Code Layer:- This part of the network contains the reduced representation of the input that is fed
into the decoder.
Decoder:- The part that tries to reconstruct the input based on previously collected
information. It is represented by the decoding function
r=g (h)
where h is the latent variable, r is reconstructed data from the latent space, which belongs to the
same space with the input x. The training procedure is done by minimizing the reconstruction
error using back-propagation. This framework was initially proposed to achieve dimensionality
reduction.
With the same purpose, [7] proposed a deep autoencoder architecture, where the encoder and the
decoder are multi-layer deep networks.
Convolutional Neural Networks (CNN, or ConvNet) are some of the best known neural networks
for modeling image data. It is very efficient in retaining the connected information between the
pixels of an image. The particular design of the layers in a CNN makes it a better choice to
process image data.
When CNN is used for image noise reduction or coloring, it is applied in an Autoencoder
framework, i.e, the CNN is used in the encoding and decoding parts of an Autoencoder. Each of
the input image samples is an image with noises, and each of the output image samples is the
corresponding image without noises. We can apply the trained model to a noisy image then
output a clear image. This is especially useful when we want to denoise image data for more
optimized outputs.
Instead of stacking the data, the Convolution Autoencoders keep the spatial information of the
input image data as they are, and extract information gently using the Convolution layer. This
process is designed to retain the spatial relationships in the data. This is the encoding process in
an Autoencoder. In the middle, there is a fully connected autoencoder with hidden layers. After
that comes the decoding process that reconstructs the image. The encoder and the decoder are
symmetric.
A few of these image classification models were reviewed and their applications were studied in
order to integrate the most suitable model into our architecture.
Extreme learning machine (ELM) is a training algorithm for single hidden layer feedforward
neural network (SLFN), which converges much faster than traditional methods and yields
promising performance.[8].
Unlike traditional feed forward network learning algorithms like back-propagation (BP)
algorithm, the ELM does not use a gradient-based technique. ELM does not require optimization
for the classification parameters as different from ANN. The classification parameters including
output weights, weight of hidden neurons are generated according to the input feature set and
class labels. The generated output and input weights are never tuned.
The model of ELM constitutes the input layer, single-hidden layer, and output layer. The model
structure of ELM is shown in Figure below, with k input layer nodes, j hidden layer nodes, i
output layer nodes, and the hidden layer activation function g(x).
Some of the types of ELMs discussed in [9] include:
Xi ∈ RN × Rj ,yi ∈ RN × Rm (i=1,2,..,N)
h=g(ax + b)
Fig 1. The model structure of ELM.
ANN [10] is a set of connected input output networks in which weight is associated with each
connection. It consists of one input layer, one or more intermediate layers and one output layer.
Learning of neural networks is performed by adjusting the weights and biases of the connection.
By updating the weight iteratively performance of the network is improved.
The behavior of neural networks is affected by learning rules, architecture, and transfer function.
Neurons of neural networks are activated by the weighted sum of input. The activation signal is
passed through a transfer function to produce a single output of the neuron. Non linearity of the
network is produced by this transfer function. During training, the inter-connection weights are
optimized until the network reaches the specified level of accuracy.
The Convolutional Neural Network (CNN) [14] is one of the most popular neural network
designs used for image classification or image noise reduction and coloring. The CNN model is
trained by taking many image samples as the inputs and labels as the outputs. This trained CNN
model is then applied to a new image to recognize if the label is correct or not. The
Convolutional Neural network involves several intermediate steps such as pooling and padding
of the image in order to help with dimensionality reduction.
The Soccer Event Dataset [13] used consists of numerous images pertaining to a football (soccer)
game. The images include various events that generally occur in a standard football match. From
these events we have taken 4 commonly seen occurrences as our image labels. These events are
namely:
● Free Kick
● Tackle
● Penalty
● Corner Kick
In the present project, two image datasets for a football match were collected:
Then, the literature survey as well as relevant coding knowledge required to implement the
project was reviewed and refreshed.
The dataset collected for the project consists of various events in a football match and is
categorized into 4 classes namely tackle, free kick, penalty and corner.
Each image and their corresponding labels were stored in two separate lists namely data and
target.
All the images were then resized to a size of 224x224 and normalized. These images were then
saved in the form of numpy arrays
● The train set consists of a total of 8000 images with the dimension of each image being
224x224x3 (corresponds to the RGB layers of the image)
● The test set consists of a total of 2000 images also with dimensions of 224x224x3
The first stage of the code implementation involved in creating the Convolutional Autoencoder
The coding was done on python using the Tensorflow library as backend for deep learning
applications. In this stage, the concepts related to autoencoder were thoroughly researched and
understood before beginning the implementation.
Decoder
● The decocoder consists of a Conv2D_transpose layer where the weights are transposed
and flipped by 180o followed by an upsampling layer.
● This layer is followed by another Conv2D_transpose layer and an upsampling layer.
● The output layer consists of 3 kernels of 3x3 dimensions with sigmoid as the activation
function.
The model was compiled on ‘Mean square error’ as the loss function, ‘Adam’ as the optimizer
and ‘Accuracy’ as the evaluation metric.A portion of the test data was also used for validation
during training . The model was then trained over 50 epochs with a batch size of 32.
6.5 Building the Image Classification Network
Proposed Classifier 1
The model employs sigmoid as the activation function and was evaluated using mean square
error.
The figure below shows the specifications used to initialize the ELM model.
Proposed Classifier 2
1. Autoencoder
Loss Accuracy
Loss Accuracy
Loss Accuracy
1. Autoencoder
Fig 7. Plots showing the training and validation loss and accuracy with increasing epochs for
Autoencoder
Fig 9. Plots showing the training and validation loss and accuracy with increasing epochs for
custom VGG16 model
Fig 10. Classification results of the VGG 16 model for sample images
8. Next Steps
It can be noted that while the ELM model offers lower accuracy the training time is extremely
less. This is in contrast to the CNN classifier that offers relatively higher accuracy but longer
training time.
Thus the next steps include optimizing both the above mentioned architectures to minimize the
tradeoff between accuracy and training time. The ways to optimize the model would be to either
test different activation functions, feed more data or tune model hyperparameters. Based on the
results obtained, the better performing architecture will then be finalized and will be put through
one more round of fine tuning to achieve the expected results.
The final model should be able to encode the large image set efficiently and accurately
distinguish between the 4 events described in our report whilst also having lower computational
time.
9. References:
[1] Jia Liu, Xiaofeng Tong, Wenlong Li, Tao Wang, Yimin Zhang, and Hongqi Wang. Automatic
player detection, labeling and tracking in broadcast soccer video. Pattern Recognition Letters,
30(2):103- 113, 2009
[2] Chung-Lin Huang, Huang-Chia Shih, and Chung-Yuan Chao. Semantic analysis of soccer
video using dynamic bayesian network. IEEE Transactions on Multimedia, 8(4):749-760, 2006.
[3] Xueming Qian, Guizhong Liu, Huan Wang, Zhi Li, and Zhe Wang. Soccer video event
detection by fusing middle level visual semantics of an event clip. In Pacific-Rim Conference on
Multimedia, pages 439-451. Springer, 2010
[4] M. Tavassolipour, M. Karimian, and S. Kasaei, “Event detection and summarization in soccer
videos using bayesian network and copula,”IEEE Transactions on circuits and systems for video
technology, vol. 24, no. 2, pp. 291–304, 2013.
[5] R. Agyeman, R. Muhammad, and G. S. Choi, “Soccer video summarization using deep
learning,” in 2019 IEEE Conference on Multimedia Information Processing and Retrieval
(MIPR). IEEE, 2019, pp. 270–273.
[6]: P. Baldi and K. Hornik. Neural networks and principal component analysis: Learning
from examples without local minima. , 2(1):53–58, January 1989.
[7]: Geoffrey Hinton and Ruslan Salakhutdinov. Reducing the dimensionality of data with
neural networks. , 313(5786):504 – 507, 2006.
[8]:Wang, J., Lu, S., Wang, SH. et al. A review on extreme learning machine. Multimed Tools
Appl (2021). https://doi.org/10.1007/s11042-021-11007-7
[9]: Ding, Shifei & Zhang, Nan & Xu, Xinzheng & Guo, Lili & Zhang, Jian. (2015). Deep Extreme
Learning Machine and Its Application in EEG Classification. Mathematical Problems in
Engineering. 2015. 1-11. 10.1155/2015/129021.
[10] A. K. Jain, J. Mao, and K. M. Mohiuddin, “Artificial neural networks: A tutorial,” Computer,
vol. 29, no. 3, pp. 31–44, 1996.
[11]: I. Tabian, H. Fu, and Z. S. Khodaei, “A Convolutional Neural Network for Impact Detection
and Characterization of Complex Composite Structures,” Sensors, vol. 19, no. 22, p. 4933, Nov.
2019.
[12]: Ben-Israel, Adi. (2002). The Moore of the Moore-Penrose Inverse. ELA. The Electronic
Journal of Linear Algebra [electronic only]. 9. 10.13001/1081-3810.1083.
[13]: Karimi, Ali & Toosi, Ramin & Akhaee, Mohammad. (2021). Soccer Event Detection Using
Deep Learning.
[14]: Krizhevsky, Alex & Sutskever, Ilya & Hinton, Geoffrey. (2012). ImageNet Classification with
Deep Convolutional Neural Networks. Neural Information Processing Systems. 25.
10.1145/3065386.