Rapport Stage PFE Finale
Rapport Stage PFE Finale
I would like to express my deepest gratitude to each and every one who supported me and
kept me going. I dedicate this work to those who have advised me, criticized me, believed in
me and supervised me. I give my thanks through these short lines to:
• My dear late father, I dedicate this to you. I know how much you have been waiting to
witness the day I graduate. I know you would be so proud of me. I miss you with
every breath I take.
• My dear mother, your kindness and your endless love made me what I am today. May
god protect you.
• My supervisor Mr. Hatem HADDAD, Co-founder and CTO of the Company. The one
who welcomed me with open arms, guided me through this project and helped me a
lot with his patience and constructive recommendations.
• All my family and friends who supported me through my darkest times. You were, and
still are, my safe heaven and my shelter.
Finally, I extend my deepest gratitude to all my teachers for the training they have provided
me and all those who have contributed directly or indirectly to the smooth running of this
project.
1
Table of contents
Acknowledgement ................................................................................................................................. 1
Table of contents.................................................................................................................................... 2
List of figures ......................................................................................................................................... 4
List of tables ........................................................................................................................................... 5
List of abbreviations.............................................................................................................................. 6
General Introduction ............................................................................................................................ 8
Chapter 1: Project concept and objectives .......................................................................................... 9
Introduction ....................................................................................................................................... 9
1. Host company presentation ...................................................................................................... 9
1.1. Company overview ............................................................................................................ 9
1.2. Research and Development ............................................................................................ 10
1.3. Services ............................................................................................................................. 10
1.4. Market Structure ............................................................................................................. 10
2. Problem statement ................................................................................................................... 11
2.1. Problem description ........................................................................................................ 11
2.2. Objectives ......................................................................................................................... 11
2.3. Need analysis .................................................................................................................... 12
2.4. The importance of an AI solution .................................................................................. 13
2.5. Work methodology .......................................................................................................... 17
Conclusion ........................................................................................................................................ 17
Introduction ..................................................................................................................................... 18
1. Machine Learning Algorithms ............................................................................................... 18
1.1. Machine learning presentation ....................................................................................... 18
1.2. Machine Learning methods ............................................................................................ 18
1.3. Application of machine learning algorithms ................................................................. 21
1.4. Some machine learning classification algorithms ......................................................... 22
2. Deep Learning.......................................................................................................................... 24
2.1. The perceptron ................................................................................................................ 25
2.2. Multilayer perceptron (MLP) ........................................................................................ 25
2.3. Activation functions......................................................................................................... 26
2.4. Loss functions .................................................................................................................. 27
2.5. Neural network training process .................................................................................... 28
2
3. Recurrent neural networks ..................................................................................................... 28
3.1. Long Short-Term Memories ........................................................................................... 29
3.2. Gated Recurrent Units .................................................................................................... 30
4. The Transformer learning ...................................................................................................... 31
4.1. Transfer learning in NLP ............................................................................................... 31
4.2. Attention mechanisms ..................................................................................................... 32
4.3. The Transformers architecture ...................................................................................... 33
5. BERT ........................................................................................................................................ 34
5.1. Pretraining BERT ........................................................................................................... 35
5.2. Fine-Tuning BERT .......................................................................................................... 37
5.3. BERT For Arabic Language .......................................................................................... 37
Conclusion ........................................................................................................................................ 38
Chapter 3: Implementation and Experimental Results ................................................................... 39
Introduction ..................................................................................................................................... 39
1. Work Environment ................................................................................................................. 39
1.1. Hardware Environment .................................................................................................. 39
1.2. Software Environment .................................................................................................... 40
2. Dataset ...................................................................................................................................... 41
2.1. NADI Shared task Dataset .............................................................................................. 41
2.2. Dataset Analysis ............................................................................................................... 42
2.3. Preprocessing the data .................................................................................................... 43
3. Implementation ........................................................................................................................ 44
3.1. Machine Learning Models .............................................................................................. 44
3.2. Transformers ................................................................................................................... 48
4. Comparative analysis .............................................................................................................. 49
5. Perspective ............................................................................................................................... 51
Conclusion ........................................................................................................................................ 51
General Conclusion ............................................................................................................................. 52
Appendix .............................................................................................................................................. 55
3
List of figures
4
List of tables
5
List of abbreviations
6
SEP: Separation
SMS: Short Message System
SPSS software: Statistical Package for the Social Science Software
Tanh: Hyperbolic Tangent
TF-IDF: Term Frequency- Inverse Document Frequency
TPU: Tensor Processing Unit
UN: United Nations
UNESCO: United Nations Educational Scientific and Cultural Organization
7
Automatic Tunisian dialect detection
General Introduction
Natural language processing’s objective was always to build a machine able to simulate the
human ability of processing and understanding language with the final aim to create
automated solutions that could understand and interact with humans with a high level of
accuracy. The goal is to make computers able to identify, understand and generate language
without any assistance. The researches done so far have achieved outstanding results in Latin-
based languages especially English. However, they neglected other languages as Arabic.
The term Arabic language can be thought of as an umbrella term, under which it is possible to
identify hundreds of varieties of the language. Despite this diversity, those varieties were
strictly restricted to speech, leaving Modern Standard Arabic (MSA) dominating the written
forms of communication. However, with the huge advancement of social media, an explosion
of written content in these neglected varieties took the internet by storm, which attracted the
attention and interest of the NLP research community in the process.
One those dialect is our beloved Tunisian dialect. It is not only used by Tunisians for their
everyday communication, but it is sometimes also a language literary by means of which one
says proverbs, rhymes, tales, riddles and poems and a language of writing songs and plays.
Today it is widely used on radio, television and in advertising. It became a necessity to find a
way to automatically deal with the Tunisian dialect in particular and all Arabic dialects in
general.
The report is divided into 3 chapters. In chapter 1, we present the host company and overview
the need analysis and the importance of an AI solution. This is followed by chapter 2 in which
we dive into the theory behind machine learning and deep learning. Finally, in chapter 3, we
will go through the data, the work procedure and present the results with a discussion of the
most significant models we tested and tried in our experiments.
8
Automatic Tunisian dialect detection
Introduction
In this chapter, we are going to present a general overview of the company that hosted this
project. We will highlight the company’s field of business, the services, and the position the
market. Also, we will go through a need analysis for the project and its importance for the
community.
1.3. Services
The team is composed of AI engineers, R&D researchers, and Linguistic experts. They use
the latest technologies to provide a state-of-the-art service such as digital reputation analysis,
sentiment analysis projects, chatbots and other consulting services. One of the most successful
products is “3ziza” [1], an AI based chatbot. During the covid-19 pandemic, it was vital for
the ministry of health to deliver recent updates about the current situation. There was a huge
number of calls at daily bases and a lot of repeating questions. To reduce the number of calls
and help spreading the information, this chatbot was born and deployed on the official website
covid-19.tn provided by the Ministry of Health.
The chatbot was taught to speak and understand French, Arabic and the Tunisian dialect. The
training was done on the “TUNIZI” dataset which is writing in Latin words mixed with
numbers just like how the Tunisians interact in social media. This should make the chatbot
more user friendly. It also was able to respond to more than 10,000 question a day and
retention of 7.4 question per user. “3ziza” has without a doubt gained the trust of the Tunisian
citizens who kept asking and interacting with it to get answers.
10
Automatic Tunisian dialect detection
The company has supporters and partners everywhere. In Tunisia, we can name the Tunisian
Presidency of the Government, the Ministry of Interior, The Ministry of Health, and other
startups like “Enova Robotics”. There are other foreign partners like in Nigeria "Starfolk
Software Solutions" and others.
These partnerships gave birth to mutual products and services. For example, the collaboration
with Enova Robotics helped creating “Jasmin” which is represented in the figure 1.2. Jasmin
is a robot implemented during the covid-19 pandemic at Sahloul hospital in Sousse. Its role
consists at reducing to the minimum as possible every direct contact between the medical staff
and the patients.
2. Problem statement
2.1. Problem description
In the field of automatic processing of the Arabic language, most of the research and
achievements took interest only on the modern standard Arabic and not giving dialectal
Arabic the proper attention, it needs. It was only till the last 10 years that these dialects began
to arouse an increasing interest within the NLP community, especially giving their increased
use on social media and the social web.
2.2. Objectives
In this work, we focus on the Tunisian dialect, and propose to provide a state of the art on the
automatic processing of this dialect by presenting a model able to detect it whenever it is
used. This model will be based on the latest AI technologies and can be implemented on the
web.
11
Automatic Tunisian dialect detection
12
Automatic Tunisian dialect detection
13
Automatic Tunisian dialect detection
companies must launch campaigns to introduce their products or services and then pays
polling companies to gather the masses opinions, analyses it and return how well it was
received. Now, a simple AI program can do this task in a short period saving both time
and money.
• It is critical to have a precise interpretation of people’s opinions. Polling companies
cannot ensure that as they investigate the opinions through close-ended questions. The
results can be biased since that the answer must be one of the choices included in the poll.
The target must be free to express his opinion the way he wants. This will have more
information and will help more in understanding them. On social media, people comment
with no limitation and the company receives full feedback and communicate better with
the audience through an AI program that automates these tasks.
This qualitative study shows that AI is a cost-effective and money saving tool to measure the
perception of the audience, interact with them and have a better understanding of the
community.
14
Automatic Tunisian dialect detection
15
Automatic Tunisian dialect detection
All the participant in the questionnaire were asked “how important to use AI in the Tunisian
administration’s communication?” and “why do you think is AI important?”. Their answers
are illustrated in the figure 1.7 and the figure 1.8.
16
Automatic Tunisian dialect detection
• The majority think that AI is important to facilitate the communication between the
Tunisian administrations and the civilians to make accessing the information easier and to
save both money and time.
• Bibliographic study to sort out the importance of this work, the problems encountered
by other researchers and the results of their efforts in the field.
• Gather data related to the topic and finding an appropriate algorithm to clean it.
• Mimic the previous work mentioned in the bibliographic study and analyze the results.
• Identify the problems and find a way to deal with it.
• Adapt the appropriate strategies to create a state-of-the-art model.
Conclusion
In this chapter, we have presented the host company, its services and market structure. We
also overviewed the importance of this project for our community and explored why AI is
needed. The next chapter will be an in-depth theoretic explaining of machine learning and
NLP fundamentals on which project was built.
17
Automatic Tunisian dialect detection
Introduction
In this chapter we are going to explain the theory behind machine learning and deep learning
and go through some of the math behind them. Then, we will dive in transfer learning and
transformers which are known by their state-of-the-art results.
18
Automatic Tunisian dialect detection
As shown in the figure 2.1, in the supervised learning the user must label each and every
example of the dataset. Then, the data is divided into two samples. The first sample is the
biggest and it is passed through a machine learning algorithm to adjust its parameters and
optimize it for a best performance. The second serves as test to check the accuracy of the
model in untrained situations.
• Clustering: usually use to train similar data and group them together. For example,
classifying bird species through pictures.
• Anomaly detection: unsupervised learning can be used to flag outliers in a dataset. For
example, detecting fraudulent transactions in the banking business.
• Association: By looking at a few key attributes in the data, unsupervised learning can
predict other attributes that goes along with them. For example: recommendation systems.
• Autoencoders: Autoencoders take input data, compress it into a code, then try to recreate
the input data from that summarized code. For example, by using both noisy and clean
versions of images in training, unsupervised learning can remove the noise from other
pictures.
19
Automatic Tunisian dialect detection
20
Automatic Tunisian dialect detection
in training robots with the ability to make series of decisions. For example: autonomous
vehicles and managing inventories in warehouses.
1.3.1. Regression
This powerful statistical method is used in finance and investing to predict a value from other
independent variable or series of variables [6]. The most used family is linear regression from
which we can distinguish two types:
21
Automatic Tunisian dialect detection
where:
= the variable that you are trying to predict (dependent variable).
= the intercept.
= the slope.
1.3.2. Classification
The classification is a predictive modeling problem in which the class label is predicted for a
given examples of data as input [7]. We can encounter four main types of classification
problems which are: binary classification for tasks having only two class labels, multi-class
classification for tasks having more than two class labels, multi-label classification refers to
those classification tasks that have two or more class labels, where one or more class labels
may be predicted for each example. For example, in each photo where the model can predict
the presence of multiple classes (like detecting the presence of “tree”, “bike” and “person” in
the picture), and Imbalanced classification refers to classification tasks where the number of
examples in each class is unequally distributed like in fraud detection and medical diagnostic
tests.
(2.3)
with:
: the weight.
22
Automatic Tunisian dialect detection
23
Automatic Tunisian dialect detection
(2.4)
with:
: posterior probability, : class prior probability, : likelihood and :
predictor prior probability.
The classification works by deriving the maximum posterior that is the maximal with
the above assumption applying to Bayes theorem. This will reduce the computational cost by
only counting the class distribution. Even though the assumption is not valid in most cases
since the attributes are dependent, Naive Bayes can still perform impressively.
There are many Naïve Bayes classifiers depending on the data estimation we make. If we
assume that the data is following a normal (also called Gaussian) distribution, we are talking
here about a Gaussian Naïve Bayes Classifier (also noted Gaussian NB classifier). The
probability of density given to a class can be computed with the equation 2.5.
(2.5)
with:
: mean of the values in x associated with class .
: the Bessel corrected variance of the values in x associated with class .
: a given probability of density to a class .
In the case when features are independent Booleans describing input, we can talk about a
multivariate Bernoulli event model. This model is popular for document classification. If xi is
a Boolean expressing the occurrence or absence of the ith term from the vocabulary, then the
likelihood of a document given a class Ck is given by the equation 2.6:
(2.6)
with:
: the likelihood of the document.
2. Deep Learning
We can say that deep learning is a subset of machine learning. It has artificial neural networks
that mimic the function of the human brain in processing information and decision making.
24
Automatic Tunisian dialect detection
(2.7)
with:
: the value of the ith input,
: the weight of the ith input,
: bias.
The bias is an extra input and an external parameter of the neuron that shifts the value of the
activation function left or right. Its existence is sometimes critical to the success of the
learning process.
The figure 2.6 summarizes the architecture of the perceptron.
25
Automatic Tunisian dialect detection
26
Automatic Tunisian dialect detection
• Sigmoid: in the equation 2.8, the sigmoid is one of the first functions to be used in neural
networks. It maps the resulting values in a range between zero and one. And for that
reason, it can be interpreted as a firing rate for the neuron with values close to zero
meaning not firing and values close to one as firing.
(2.8)
• Rectified linear (Relu): in the equation 2.9, the Relu is the most used activation function in
neural networks.
(2.9)
• Hyperbolic Tangent: Provided in the equation 2.10, hyperbolic tangent function is mostly
used for classification between two classes. The values of the results are mapped between
-1 and 1. And therefore, it makes optimization a lot easier than the Sigmoid function.
(2.10)
• Softmax: provided in the equation 2.11, softmax function is used for multiclass
classification.it calculates the probability of belonging to all possible classes and based on
these probabilities we can determine the target class for a given input.
(2.11)
(2.12)
where:
: the desired output,
27
Automatic Tunisian dialect detection
• Mean Absolute Error: Also note as L1-loss. This one is used to measure the average
magnitude of the error without any consideration to its direction. The equation 2.13 is the
mathematical expression for the mean absolute error function.
(2.13)
where:
: the desired output.
(2.14)
where:
: the desired output.
28
Automatic Tunisian dialect detection
tasks, the state of the current input depends on the other previously provided inputs. So, the
role of the recurrent neural networks is to find a relationship between these inputs.
There are two major problems with the use of RNNs that happen during the training process
when the gradients are being propagated back in time and they are:
• The vanishing gradient problem: as the gradients from deeper layers must go through
many matrix multiplications, if their values are small, they will begin to shrink
exponentially until they eventually vanish making the learning process impossible to carry
out.
• The exploding gradient problem: opposite to the vanishing gradient problem, if the values
of the gradients are large, the matrix multiplications will make them grow larger and
larger until at some point, they explode and the model crushes.
29
Automatic Tunisian dialect detection
At this point, we need to update the cell state. So, the previous cell state is multiplied by the
forget vector and then it is added to the input vector from the input gate. This gives us a new
cell state more relevant for the training process.
Finally, we find the output gate. The previous hidden state and the current input are passed
through a sigmoid function and the new cell state pass to the tanh function. Then, by
multiplying the tanh output with the sigmoid output, it can be decided what information the
output should carry.
• Update gate: it has the same behavior of both the input gate and the forget gate in an
LSTM unit. After a weighted summation to the previous state and the current input, the
result passes into a sigmoid function. According to the result, which is between 0 and 1,
the gate decides what information to use and what to throw.
• Reset Gate: It is another gate used to decide how much past information to drop by
multiply the input and the previous state with their corresponding weights, sum the results
and feed it to the sigmoid function.
30
Automatic Tunisian dialect detection
The figure 2.10 summarizes the transfer learning process. The first model is usually trained on
a large unlabeled dataset (Wikipedia articles, news articles…). Technically, the labels are
contained within the text (as in the context of the sentence) and therefore, we cannot call it a
supervised learning but a self-supervised learning. The model will acquire knowledge about
the semantics behind the language that we trained on and create a general understanding about
the languages (subject-verb agreement, gender, synonyms…). The second model will use this
knowledge and fine tune it to a smaller, usually labeled dataset to achieve a certain task with
better performance and faster training than the usual methods.
31
Automatic Tunisian dialect detection
What will come out of CNN will be an info of the type: "the words 9, 12 and 24 are very
important to give the exact meaning of this sentence, moreover it will be necessary to
combine or correlate 9 and 12, remember that when decoding the sentence ". At each step we
decide to keep a particular word (thus concentrating the information) which is supposed to be
32
Automatic Tunisian dialect detection
important when generating the next word. That said, it is very positional. Much depends on
the position of words in the sentence and their position relative to each other to create context,
more than their semantic context similarity.
33
Automatic Tunisian dialect detection
5. BERT
BERT [21] is a bidirectional encoder representation from transformer. This architecture can
achieve great results in Neural Machine Translation, Question Answering, Sentiment
34
Automatic Tunisian dialect detection
Analysis, text summarization and many more tasks. All these problems require the
understanding of language, so we can use BERT to understand the language and then fine-
tune it depending on the problem we want to solve. As such, the training of BERT is done on
two phases; The first phase is pretraining, where the model understands what is language and
context. The second phase is fine-tuning learns how to solve the problem.
35
Automatic Tunisian dialect detection
36
Automatic Tunisian dialect detection
To check whether the second sentence is needed for the first, the following steps are
performed:
• The entire input sequence is fed to the Transformer model.
• The output of the [CLS] token should be transformed into a 2×1 shaped vector, using a
simple classification layer.
• Calculate the probability of IsNextSequence with SoftMax.
37
Automatic Tunisian dialect detection
AraBERTv0.2-
1.38G / 371M No 200M / 77GB / 8.6B
large
We have also the models developed by Ali Safaya [25] given in table 2.2.
Table 2.2: Different versions of Ali Safaya's BERT for Arabic language
Model BERT-Mini BERT-Medium BERT-Base BERT-Large
Hidden Layers 4 8 12 24
Attention heads 4 8 12 16
Hidden size 256 512 768 1024
Parameters 11M 48M 110M 340M
Conclusion
In this chapter, we explained what machine learning is and how it works. We have dived in
deep learning and the magic that makes it function. And then, we went through how transfer
learning in general, and transformers in particular work. In the next chapter, we will try those
methods in practice, examine the results and deduct the best solution for our task.
38
Automatic Tunisian dialect detection
1. Work Environment
1.1. Hardware Environment
1.1.1. GPU vs CPU
For machine learning in general and deep learning, the most import part in the hardware
environment is the graphical processing unit.
A neural network might have around 100, 1000 or even 10000 unit. A normal CPU can still
manage to handle their calculations in a matter of minutes or a few hours. But what if their
parameters are millions? It would take days or probably years if the network is too big and
your computer would probably give up before finishing the task. The GPU is a better
replacement in this case. It offers the possibility to run all the processes simultaneously
instead of one after the other, making the training process faster, saving both money and
energy, and freeing CPU for other tasks. The figure 3.1 shows how a GPU can operate 10
times more than a CPU.
Figure 3.1: Theoretical Peak Point Operations per Clock Cycle [26].
39
Automatic Tunisian dialect detection
40
Automatic Tunisian dialect detection
1.2.4. Pandas
Pandas [30] is the most preferred python library by data scientist for data analysis and data
manipulation. Its fast, expressive, and flexible data structure makes real-world data analysis a
significantly easy task. So many functionalities are built into this package that the options can
be overwhelming.
1.2.5. Sci-Kit learn
1.2.5. Sci-Kit learn [31] a free python machine learning library featuring many algorithms
for processing, regression, classification, and clustering and designed to work along with
NumPy and SciPy libraries.
1.2.6. PyTorch
PyTorch [32] is an open-source library for machine learning based on Torch library which is a
scientific computing framework, and a script language based on the Lua programming
language.
It was primarily developed by Facebook's AI Research lab and it is used for various
applications like computer vision and natural language processing. This library provides two
great features which are:
• a strong acceleration via graphics processing units for Tensor computing.
• Deep neural networks built on a type-based automatic differentiation system.
1.2.7. HuggingFace library
Created by Hugging Face [33], an NLP-focused startup, this library provides a state-of-the-art
pretrained Transformer’s architectures such as BERT, ELECTRA, GPT-2 and many more.
The main purpose of this library is to do a variety of NLP tasks such as text classification,
information extraction, question answering, and text generation.
2. Dataset
2.1. NADI Shared task Dataset
NADI [34] as in Nuanced Arabic Dialect Identification is a shared task that includes two
subtasks for country level identification and province level identification. This task presents a
dataset with two sets of labels for each subtask. In other words, the same tweets appear in the
two subtasks but with different labels. In addition, a script was provided for an unlabeled 10
million tweets for optional use.
For data collection, a tweeter API was used to crawl and collect data from 100 provinces
belonging to 21 Arabic country. The process took around 10 months to finish. Next step was
41
Automatic Tunisian dialect detection
identifying users who tweeted exclusively in the same province during the whole 10 months
for precision purposes in the labeling process.
42
Automatic Tunisian dialect detection
Figure 3.2: Distribution of classes for Training and Development sets [34].
We can distinguish some issues in the dataset. First, the most important problem is that the
data-labeling method is not ideal as one person in a certain country can use other dialects,
during the 10 months of data collecting, other than the one used in that country which is the
case for recent immigrants for example. Second, the distribution of data is unbalanced which
can be seen in figure 3.2. This unbalance will have a major impact in term of training and
therefore will affect the results of the predicting models. Third, communication in social
media is not MSA-free. Users can switch between these two varieties. And since that MSA is
common for all Arabic users, it might create a confusion in the training process. Another
encountered problem consists of having non-Arabic words in the datasets. Some languages
such as Farsi is written using the Arabic characters and therefore the API cannot find the
difference and will consider it Arabic.
43
Automatic Tunisian dialect detection
• Joint parameter selection: You can grid search over parameters of all estimators in the
pipeline at once.
For the BERT models, after tokenizing the data with BERT tokenizer rearranging the training
and development datasets, they are passed into a pytorch dataloader. We passed two variables
to the BERT’s forward function which are input_ids and attention_mask. The input_ids are
simply the numeric representations of the tokens. Attention_mask is useful when adding
padding to the input tokens. The attention mask gives an idea about which input_ids
correspond to padding. The padding is added to make sure all input sentences have the same
length to form tensor objects properly.
3. Implementation
3.1. Machine Learning Models
• k-nearest neighbors
We fed the data to the classifier and the results were very bad with 0.241 on the accuracy
score and a 0.0709 F1-score. The confusion matrix in the figure 3.3 shows that this algorithm
is a very bad choice. Basically, all the dialects were considered Egyptian considering that the
Egyptian data is the one dominating in the dataset.
44
Automatic Tunisian dialect detection
45
Automatic Tunisian dialect detection
46
Automatic Tunisian dialect detection
Since the SVM classifier did well, we added a linear kernel to see if it can help improving the
results. Linear Kernel is used when the data is Linearly separable, that is, it can be separated
using a single Line. It is one of the most common kernels to be used. It is mostly used when
there are many Features in a particular Data Set as in our case.
47
Automatic Tunisian dialect detection
Clearly there was a minor improvement as the F1-score became 0.1869 and the accuracy
score reached 0.3438. However, there is still confusion when it is comes to identifying the
dialect of the Golf countries illustrated in figure 3.7.
3.2. Transformers
Our training process was divided into two phases: first, tuning the language model using 10
million tweets and then tuning using 19,950 tweets for the classification task. Due to the
limited resources provided by Google Collaboratory for free users, we were not able to use
the whole 10 million unlabeled tweets. Whenever we pass the 100,000 tweets, the system
immediately crushes and the environment restarts, deleting all the previous work. To
overcome this issue, we stuck with only 100,000 unlabeled tweets and added both training
48
Automatic Tunisian dialect detection
and testing data to train the language model. The pretraining process of each model took
around 8 hours to finish for 10 epochs. When we were analyzing the loss values, we noticed
that after 3 epochs they stay stable for a while and sometimes start growing instead of
dropping meaning that the model starts getting overfitted. Therefore, the optimal number of
epochs is 3. This made the pretraining time shorter (around 4 to 5 hours). After completing
the pretraining task, we added a linear layer to the embedding model which will take care of
the classification task. Again, we have tested the training on 8 epochs which took around 4
hours. The observations of loss made it clear that 4 epochs are enough to avoid overfitting.
In this project, we couldn’t use the large models since that training them on GPU takes
forever and Google Collaboratory gives a limited time to use the GPU (around 10 hours).
We have worked with 3 different models from the huggingface transformers library which are
“aubmindlab/bert-base-arabert”, “aubmindlab/bert-base-arabertv2” and “asafaya/bert-
base-arabic”. There were some issues working with these models since the documentation
we found was outdated and a few changes were added. Basically, the debugging part took
most of the time. However, eventually everything worked out and the results were satisfying.
Table 3.2 : Transformers Performance summary
Model accuracy Macro macro macro
recall precision f1 avg
aubmindlab/bert-base-arabert 0.3082 0.1776 0.1799 0.2012
4. Comparative analysis
in this project we have built several models to find the best one for the dialect identification
task. A comparison between the results described in table 3.3 shows the top 6 different
approaches used that helped finding the best solution. Non deep learning models had almost
similar performances. BERT model without fine-tuning of the language model is still very
powerful just by adding a classification layer. However, a language model must be added so
that the performance can exceed 20 on the F1-score which is by using the extra unlabeled
tweets since it is closer to the tweets from Wikipedia articles which the original models were
trained on.
49
Automatic Tunisian dialect detection
Table 3.3: F1 Score of different models on development set and their ranking on the task
Ranking Model F1-score
1 asafaya/bert-base-arabic 0.2231
2 aubmindlab/bert-base-arabertv2 0.2107
3 aubmindlab/bert-base-arabert 0.2012
4 SVM with linear kernel 0.1869
5 SVM 0.1696
6 Bernoulli Naive Bayes 0.1401
By looking at the confusion matrix for BERT with fine-tuning of language model in Figure 2,
we can notice the consequences of class imbalance and overlapping of the features. The most
obvious one is in the gulf countries mentioned before, we can observe that the model
predicted most of them as Saudi Arabia and did not predict any of Bahrain correctly.
50
Automatic Tunisian dialect detection
5. Perspective
Even though the benchmarks are satisfying and respond to the goal of the internship, we
definitely still have more room for more improvements. The data was unbalanced when it
comes to the distribution of the tweets. This created a lot of problems in the classification task
especially while using machine leaning algorithms. In addition, Data preprocessing was not
ideal. We feel the data could be cleaned better. Also, the nature of Arabic language makes it
very hard to deal with. Therefore, and until this day, the Arabic NLP community still lacks the
tools to subdue the language for our favor. There is still a lot of research to be done that can
help us in this task.
Another way to improve the results relies on using a more trained language model. In our
project, our language model was only trained on unlabeled 100,000 tweets which is just 1% of
the data we have. This 1% made a small difference in comparison with models without a
language model. Using the rest of the unlabeled data can actually improve our results but to
do so, the training must be done on Cloud GPUs.
In addition, we have only used the base models of Bert. The large versions have more
parameters and trained on larger datasets. This mean that a large model can perform a lot
better than a base model. However, those kinds of models are very slow to train on a GPU.
We need either more GPUs working in parallel or training on TPUs.
Conclusion
In this chapter, we have presented the overall work done in this project. We discussed the
results achieved with both machine learning and deep learning algorithms. The results we
came across were decent as expected due to the limited resources we had. However, our
models can achieve a better result if they were trained on a cloud GPU.
51
Automatic Tunisian dialect detection
General Conclusion
During this internship, I had the opportunity to dive into the field of natural language
processing. Having a prior experience in machine learning and NLP with deep learning, this
project was a perfect opportunity to enlarge my knowledge, face some new challenges, and
have a wider idea about the state of art in this field.
The objective of this work was to design a state-of-the-art classification model for Arabic
dialect detection. We were able implement multiple models for dialect identification using
Sci-kit learn, PyTorch and the HuggingFace library. This was my first experience with BERT.
The results were good. They were achieved via three stages: firstly, we further pre-trained a
publicly released BERT model (i.e., Arabic-BERT) on 100000 unlabeled tweets. Secondly,
we trained the resultant model on the NADI labelled data for country-level identification
multiple times, independently. Thirdly, we selected the best performing model (based on their
performance on the development dataset), and compared it to traditional machine learning
results.
There still room for future improvements and better accuracy. We can change our
preprocessing procedure and use the whole unlabeled dataset. We can also use cloud GPUs
for faster training and try the large version of BERT.
There is no doubt this project is important for our community and we will benefit a lot from it.
It can encourage future researches like to experiment with studying the interaction between
MSA and DA in novel ways. For example, questions as to the utility of using DA data to
improve MSA regional use classification systems and vice versa can be investigated
exploiting various machine learning methods.
52
Automatic Tunisian dialect detection
References
[1] Documents provided by the host company.
[2] WANG Hua, MA Cuiqin and ZHOU Lijuan - A Brief Review of Machine Learning and
its Application - 19 Jun 2021.
[3] sha Salian – SuperVize Me: What’s the Difference Between Supervised, Unsupervised,
Semi-Supervised and Reinforcement Learning: - https ://
blogs.nvidia.com/blog/2018/08/02/supervised-unsupervised-learning/ - 2 August 2018.
[4] Mar Mbengue - Machine Learning pour débutant: Introduction au Machine Learning:
https://penseeartificielle.fr/introduction-au-machine-learning/ - 21 April 2020.
[5] New York Times - Computer Wins on ‘Jeopardy!’: Trivial, It’s Not: https
://www.nytimes.com/2011/02/17/science/17jeopardy-watson.html - 17 February 2011.
[6] Charles Manski,- « Regression », Journal of Economic Literature -1 March 1991.
[7] Sidath Asiri - Machine Learning Classifiers: https://towardsdatascience.com/machine-
learning-classifiers-a5cc4e1b0623- 11 Jun 2018.
[8] Rajvi Shah - Introduction to k-Nearest Neighbors (kNN) Algorithm -
https://ai.plainenglish.io/introduction-to-k-nearest-neighbors-knn-algorithm-e8617a448fa8- 3
March 2021.
[9] Quinlan, J. R - "Simplifying decision trees". International Journal of Man-Machine
Studies. – 1 September 1987.
[10] Ho, Tin Kam - Random Decision Forests - 6 August 2002.
[11] John, George H.; Langley, Pat. Estimating Continuous Distributions in Bayesian
Classifiers - 20 Feb 2013.
[12] B.Mehlig - Multi layer perceptron – 10 February 2021.
[13] A.Sharma V- Activation Functions - 18 Jul 2016.
[14] R.Parmar - Loss Functions: https://towardsdatascience.com/common-loss-functions-in-
machinelearning-46af0ffc4d23 - 2 September 2018.
[15] V.S. Bawa – the architecture of RNN: https://pydeeplearning.weebly.com/blog/basic-
architecture-of-rnn-and-lstm - 18 January 2017.
[16] Savvas Varsamopoulos - Designing neural network-based decoders for surface codes –
November 2018.
53
Automatic Tunisian dialect detection
54
Automatic Tunisian dialect detection
Appendix
55
Automatic Tunisian dialect detection
56
Automatic Tunisian dialect detection
57