0% found this document useful (0 votes)
749 views57 pages

Rapport Stage PFE Finale

Uploaded by

Assouma Asma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
749 views57 pages

Rapport Stage PFE Finale

Uploaded by

Assouma Asma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

Acknowledgement

I would like to express my deepest gratitude to each and every one who supported me and
kept me going. I dedicate this work to those who have advised me, criticized me, believed in
me and supervised me. I give my thanks through these short lines to:

• My dear late father, I dedicate this to you. I know how much you have been waiting to
witness the day I graduate. I know you would be so proud of me. I miss you with
every breath I take.

• My dear mother, your kindness and your endless love made me what I am today. May
god protect you.

• My supervisor Mr. Hatem HADDAD, Co-founder and CTO of the Company. The one
who welcomed me with open arms, guided me through this project and helped me a
lot with his patience and constructive recommendations.

• My academic supervisor Mr. Marwen KERMANI, I would like to thank my teacher


with all my heart for the praiseworthy efforts. You gave me your undivided attention,
guided me and kept praising to keep moving forward.

• All my family and friends who supported me through my darkest times. You were, and
still are, my safe heaven and my shelter.

Finally, I extend my deepest gratitude to all my teachers for the training they have provided
me and all those who have contributed directly or indirectly to the smooth running of this
project.

1
Table of contents
Acknowledgement ................................................................................................................................. 1
Table of contents.................................................................................................................................... 2
List of figures ......................................................................................................................................... 4
List of tables ........................................................................................................................................... 5
List of abbreviations.............................................................................................................................. 6
General Introduction ............................................................................................................................ 8
Chapter 1: Project concept and objectives .......................................................................................... 9
Introduction ....................................................................................................................................... 9
1. Host company presentation ...................................................................................................... 9
1.1. Company overview ............................................................................................................ 9
1.2. Research and Development ............................................................................................ 10
1.3. Services ............................................................................................................................. 10
1.4. Market Structure ............................................................................................................. 10
2. Problem statement ................................................................................................................... 11
2.1. Problem description ........................................................................................................ 11
2.2. Objectives ......................................................................................................................... 11
2.3. Need analysis .................................................................................................................... 12
2.4. The importance of an AI solution .................................................................................. 13
2.5. Work methodology .......................................................................................................... 17
Conclusion ........................................................................................................................................ 17
Introduction ..................................................................................................................................... 18
1. Machine Learning Algorithms ............................................................................................... 18
1.1. Machine learning presentation ....................................................................................... 18
1.2. Machine Learning methods ............................................................................................ 18
1.3. Application of machine learning algorithms ................................................................. 21
1.4. Some machine learning classification algorithms ......................................................... 22
2. Deep Learning.......................................................................................................................... 24
2.1. The perceptron ................................................................................................................ 25
2.2. Multilayer perceptron (MLP) ........................................................................................ 25
2.3. Activation functions......................................................................................................... 26
2.4. Loss functions .................................................................................................................. 27
2.5. Neural network training process .................................................................................... 28

2
3. Recurrent neural networks ..................................................................................................... 28
3.1. Long Short-Term Memories ........................................................................................... 29
3.2. Gated Recurrent Units .................................................................................................... 30
4. The Transformer learning ...................................................................................................... 31
4.1. Transfer learning in NLP ............................................................................................... 31
4.2. Attention mechanisms ..................................................................................................... 32
4.3. The Transformers architecture ...................................................................................... 33
5. BERT ........................................................................................................................................ 34
5.1. Pretraining BERT ........................................................................................................... 35
5.2. Fine-Tuning BERT .......................................................................................................... 37
5.3. BERT For Arabic Language .......................................................................................... 37
Conclusion ........................................................................................................................................ 38
Chapter 3: Implementation and Experimental Results ................................................................... 39
Introduction ..................................................................................................................................... 39
1. Work Environment ................................................................................................................. 39
1.1. Hardware Environment .................................................................................................. 39
1.2. Software Environment .................................................................................................... 40
2. Dataset ...................................................................................................................................... 41
2.1. NADI Shared task Dataset .............................................................................................. 41
2.2. Dataset Analysis ............................................................................................................... 42
2.3. Preprocessing the data .................................................................................................... 43
3. Implementation ........................................................................................................................ 44
3.1. Machine Learning Models .............................................................................................. 44
3.2. Transformers ................................................................................................................... 48
4. Comparative analysis .............................................................................................................. 49
5. Perspective ............................................................................................................................... 51
Conclusion ........................................................................................................................................ 51
General Conclusion ............................................................................................................................. 52
Appendix .............................................................................................................................................. 55

3
List of figures

Figure 1.1: the main axes of the company's business .............................................................................. 9


Figure 1.2: Jasmin Robot[1] .................................................................................................................. 11
Figure 1.3: Functional Need Diagram ................................................................................................... 12
Figure 1.4: Age distribution of the participants..................................................................................... 14
Figure 1.5: Occupations distribution of the participants ....................................................................... 14
Figure 1.6: Fields of study/work ........................................................................................................... 15
Figure 1.7: The importance of AI in communication ............................................................................ 16
Figure 1.8: The benefits of AI ............................................................................................................... 16
Figure 2.1: Supervised Learning strategy [4] ........................................................................................ 16
Figure 2.2: Unsupervised learning example: feature extraction [3] ...................................................... 19
Figure 2.3: Semi-supervised learning algorithm: CT scan example [3] ................................................ 20
Figure 2.4: IBM WATSON wins jeopardy![5] ..................................................................................... 20
Figure 2.5: Classification example with KNN [8]................................................................................. 23
Figure 2.6: The architecture of the perceptron[12]................................................................................ 25
Figure 2.7: Artificial Neural Network[12] ............................................................................................ 26
Figure 2.8: Internal architecture of an LSTM unit [16] ................................... Erreur ! Signet non défini.
Figure 2.9: Architecture of a GRU cell[17]........................................................................................... 29
Figure 2.10:Transfer Learning Diagram................................................................................................ 31
Figure 2.11: the three steps of transfer learning in NLP[18] ................................................................. 32
Figure 2.12: global architecture of the transformer[20] ........................................................................ 34
Figure 2.13: an example of MLM training[22] ..................................................................................... 35
Figure 2.14: the changes added to the encoder for MLM[22] ............................................................... 36
Figure 2.15: BERT input representation[23] ......................................................................................... 36
Figure 3.1: Theoretical Peak Point Operations per Clock Cycle[26] .................................................... 36
Figure 3.2: Distribution of classes for both Train and Development sets[34]....................................... 39
Figure 3.3: kNN classifier's confusion matrix ....................................................................................... 44
Figure 3.4: Random Forest classifier's confusion matrix ...................................................................... 45
Figure 3.5: Bernoulli NB classifier's confusion matrix ......................................................................... 46
Figure 3.6: SVM classifier's confusion matrix ...................................................................................... 47
Figure 3.7: SVM classifier with linear kernel's confusion matrix ......................................................... 48
Figure 3.8: BERT's confusion matrix .................................................................................................... 50

4
List of tables

Table 2.1: different versions of araBERT ............................................................................................. 38


Table 2.2: different versions of Ali Safaya's BERT for Arabic language ............................................. 38
Table 3.1: Distribution of country-level dialect identification data across data splits .......................... 42
Table 3.2: Transformers Performance summary ................................................................................... 49
Table 3.3: F1 Score of different models on development set and their ranking on the task ................. 50

5
List of abbreviations

AI: Artificial Intelligence


araBERT: Arabic Bidirectional Encoder Representation from Transformer
B2B: Business to Business
BERT: Bidirectional Encoder Representation from Transformer
CIPIT: Center of Intellectual Property and Information Technology
CLS: Classification
CNN: Convolutional Neural Network
CPU: Central Processing Unit
CT scan: Computed Tomography Scan
DA: Dialectal Arabic
DevOps: Development and Operations
DL: Deep Learning
FFN: Feed Forward Network
GPU: Graphical Processing Unit
GRU: Gated Recurrent Unit
IBM: International Business Machines
kNN: k Nearest Neighbors
LSTM: Long Short-Term Memory
MEA: Middle East and Africa
MENA: Middle East and North Africa
ML: Machine Learning
MLM: Masked Language Modeling
MLP: Multi-Layer Perceptron
MSA: Modern Standard Arabic
NADI: Nuanced Arabic Dialect Identification
NLP: Natural Language Processing
NN: Neural Network
NSP: Next Sentence Prediction
R&D: Research and Development
Relu: Rectified Linear
RNN: Recurrent Neural Network

6
SEP: Separation
SMS: Short Message System
SPSS software: Statistical Package for the Social Science Software
Tanh: Hyperbolic Tangent
TF-IDF: Term Frequency- Inverse Document Frequency
TPU: Tensor Processing Unit
UN: United Nations
UNESCO: United Nations Educational Scientific and Cultural Organization

7
Automatic Tunisian dialect detection

General Introduction

Natural language processing’s objective was always to build a machine able to simulate the
human ability of processing and understanding language with the final aim to create
automated solutions that could understand and interact with humans with a high level of
accuracy. The goal is to make computers able to identify, understand and generate language
without any assistance. The researches done so far have achieved outstanding results in Latin-
based languages especially English. However, they neglected other languages as Arabic.

The term Arabic language can be thought of as an umbrella term, under which it is possible to
identify hundreds of varieties of the language. Despite this diversity, those varieties were
strictly restricted to speech, leaving Modern Standard Arabic (MSA) dominating the written
forms of communication. However, with the huge advancement of social media, an explosion
of written content in these neglected varieties took the internet by storm, which attracted the
attention and interest of the NLP research community in the process.

One those dialect is our beloved Tunisian dialect. It is not only used by Tunisians for their
everyday communication, but it is sometimes also a language literary by means of which one
says proverbs, rhymes, tales, riddles and poems and a language of writing songs and plays.
Today it is widely used on radio, television and in advertising. It became a necessity to find a
way to automatically deal with the Tunisian dialect in particular and all Arabic dialects in
general.

The report is divided into 3 chapters. In chapter 1, we present the host company and overview
the need analysis and the importance of an AI solution. This is followed by chapter 2 in which
we dive into the theory behind machine learning and deep learning. Finally, in chapter 3, we
will go through the data, the work procedure and present the results with a discussion of the
most significant models we tested and tried in our experiments.

8
Automatic Tunisian dialect detection

Chapter 1: Project concept and objectives

Introduction
In this chapter, we are going to present a general overview of the company that hosted this
project. We will highlight the company’s field of business, the services, and the position the
market. Also, we will go through a need analysis for the project and its importance for the
community.

1. Host company presentation


1.1. Company overview
ICompass was founded in July 2019 [1]. It is a B2B Artificial intelligence startup, working in
the field of Linguistics and Deep Learning. The figure below shows the core technology of
the company.

Figure 1.1 : The main axes of the company's business [1].


As shown in the figure 1.1, the startup provides AI services in the fields of Natural Language
Processing, Machine Translation, Speech Recognition and Synthesis which can help other
companies through digital transformation and upgrading the customer services.
In less than two years, the company has made massive achievements. It obtained the "startup
act” by the Ministry of Communication Technology and Digital Transformation, was named
“The most innovative private Text Speech A.I technology company in North Africa” by Zindi
Africa and became a project member at the "UNESCO World Atlas of Languages". The
company has also achieved amazing results in the Safe Tunisia Competition: "Novation city"
and the “Open Innovation”; Covid-19 Maghreb Bootcamp (25, 26, 27th June 2020).
9
Automatic Tunisian dialect detection

1.2. Research and Development


Despite its religious, political, and cultural significance, the Arabic language has received
little attention in modern NLP society. And as a contributor in the linguistics field, iCompass
aims to help break this barrier and handle this negligence. Until today, the team has published
six publications, one of them is the "Tunizi Dataset" [1].
On November 2019, ZINDI Africa in association with the UNESCO, CIPIT and Knowledge 4
All, launched a competition at Strathmore University, which aims at encouraging data
scientist on gathering and creating datasets for the African countries. iCompass came up with
the “Tunizi Dataset” which was selected among the top five outstanding submissions. The
company then was asked to create another one to be included at the UNESCO World Atlas of
Languages platform.

1.3. Services
The team is composed of AI engineers, R&D researchers, and Linguistic experts. They use
the latest technologies to provide a state-of-the-art service such as digital reputation analysis,
sentiment analysis projects, chatbots and other consulting services. One of the most successful
products is “3ziza” [1], an AI based chatbot. During the covid-19 pandemic, it was vital for
the ministry of health to deliver recent updates about the current situation. There was a huge
number of calls at daily bases and a lot of repeating questions. To reduce the number of calls
and help spreading the information, this chatbot was born and deployed on the official website
covid-19.tn provided by the Ministry of Health.
The chatbot was taught to speak and understand French, Arabic and the Tunisian dialect. The
training was done on the “TUNIZI” dataset which is writing in Latin words mixed with
numbers just like how the Tunisians interact in social media. This should make the chatbot
more user friendly. It also was able to respond to more than 10,000 question a day and
retention of 7.4 question per user. “3ziza” has without a doubt gained the trust of the Tunisian
citizens who kept asking and interacting with it to get answers.

1.4. Market Structure


iCompass aims at contributing to the digital transformation of the middle east and Africa
region (MEA region). By liberalizing digital communication and having a deeper computer
understanding of modern languages and dialects, they are committed to create new R&D
smart solutions to the economy.

10
Automatic Tunisian dialect detection

The company has supporters and partners everywhere. In Tunisia, we can name the Tunisian
Presidency of the Government, the Ministry of Interior, The Ministry of Health, and other
startups like “Enova Robotics”. There are other foreign partners like in Nigeria "Starfolk
Software Solutions" and others.
These partnerships gave birth to mutual products and services. For example, the collaboration
with Enova Robotics helped creating “Jasmin” which is represented in the figure 1.2. Jasmin
is a robot implemented during the covid-19 pandemic at Sahloul hospital in Sousse. Its role
consists at reducing to the minimum as possible every direct contact between the medical staff
and the patients.

Figure 1.2: Jasmin Robot [1].

2. Problem statement
2.1. Problem description
In the field of automatic processing of the Arabic language, most of the research and
achievements took interest only on the modern standard Arabic and not giving dialectal
Arabic the proper attention, it needs. It was only till the last 10 years that these dialects began
to arouse an increasing interest within the NLP community, especially giving their increased
use on social media and the social web.

2.2. Objectives
In this work, we focus on the Tunisian dialect, and propose to provide a state of the art on the
automatic processing of this dialect by presenting a model able to detect it whenever it is
used. This model will be based on the latest AI technologies and can be implemented on the
web.

11
Automatic Tunisian dialect detection

2.3. Need analysis


Tunisian dialect is mainly a spoken language, used by Tunisians in their everyday life. But it
also exists in their literature, proverbs, rhymes, riddles, tales, and poems. Today it is used on
radio, television, cinema and in advertising.
During these last years, the massive adoption of new modes of communication (SMS, e-mails,
Facebook, Twitter, etc.) has improved writing of messages and various textual content in
dialect, especially in the Latin script. So, it became a necessity to find an approach to deal
with it for numerous reasons and multiple purposes. The figure 1.3 shown the functional need
diagram.

Figure 1.2 : Functional Need Diagram.


Non-functional requirements are quality specifications of the software system. Some might be
considered as critical to the success of the system. They must guarantee the following
operational needs:
• Continuous availability: the system is always available for use whenever needed.
• Performance: the system has to run smoothly.
• Integrity: In case of error or bug the user must be alerted.
• Sustainability: the model can be retrained and updated.

12
Automatic Tunisian dialect detection

2.4. The importance of an AI solution


Artificial intelligence is a powerful tool growing with each day and becoming an essential
part of all industries. Many papers were published explaining its use and importance, proving
how reliable this tool can be in the economic growth and development. However, it is slightly
used in the Tunisian industries, and we can say it is missing even though it is getting a lot of
attention lately between the young developers.
To explain how important AI can be, we can rely on two following studies:

2.4.1. Qualitative Study


The following points explains why AI is the best solution or many problems:
• Companies need a tool to identify the perception of the messages that they deliver on
social media. This is an important task and can be considered vital as some messages can
be mistaken and take the company to the rock-bottom. Communicating virtually with
people having different intellectual and cultural background is a delicate task and a
double-edged weapon. A certain misunderstanding can distort a brand’s name. For
example, in May 2020, thousands of people united across the globe in support of the
movement “Black Lives Matter” after the murder of George Floyd, an American victim of
the police brutality. The Tunisian brand “Nana” joined the movement by posting a
controversial picture of a white woman with a face painted with black shades. The
intended message was to say that all people are equal no matter what color your skins is.
However, the Tunisian community took it as an act of racism. This led the brand to delete
the picture and make a public apology. Through AI, we can keep on track the
community’s perception and alert the companies in cases like this.
• Hashtags are massively used on social media when creating a content. Videos, pictures,
articles and even comments might contain some hashtags. It became a habit to add a
hashtag to create a visual identity, to target larger groups of people, to distinguish from
competitors and to build values. It is a tool used by so many influencers around the globe
and AI makes a perfect tool to keep on track this enormous wave of hashtags. We can take
the Tunisian ministry of tourism as an example which has launched a visual content in the
past few months and declared that a new slogan will follow up in the near future that will
be used on social media. Through AI, the ministry will track the slogan and the
communities’ reviews in what is called today “nation branding”.
• The classical way for the company to interact with consumers and potential customers was
through polling companies which is very money and time consuming. To elaborate,

13
Automatic Tunisian dialect detection

companies must launch campaigns to introduce their products or services and then pays
polling companies to gather the masses opinions, analyses it and return how well it was
received. Now, a simple AI program can do this task in a short period saving both time
and money.
• It is critical to have a precise interpretation of people’s opinions. Polling companies
cannot ensure that as they investigate the opinions through close-ended questions. The
results can be biased since that the answer must be one of the choices included in the poll.
The target must be free to express his opinion the way he wants. This will have more
information and will help more in understanding them. On social media, people comment
with no limitation and the company receives full feedback and communicate better with
the audience through an AI program that automates these tasks.
This qualitative study shows that AI is a cost-effective and money saving tool to measure the
perception of the audience, interact with them and have a better understanding of the
community.

2.4.2. Quantitative study


The following qualitative study was done by iCompass. The team designed an online
questionnaire for data collection. 63 people volunteered to fill the form in the appendix. They
gave their opinion which were analyzed later by the SPSS 25.0 software.
The study targeted people from different ages as shown in the figure 1.4:

Figure 1.3 : Age distribution of the participants.


As we can see in the figure 1.4, 73% of the people who participated in the questionnaire are
aged between 18 to 25 years old. 15.87% (10 people) are aged between 26 to 35 years old,
9.52% between 36 to 35 years old and 1.52% aged more than 55 years old.
The chart given in the figure 1.5 shows the occupation of the participants.

14
Automatic Tunisian dialect detection

Figure 1.4 : Occupations distribution of the participants.


Most of the participants are students (57.14%). We find second, 31.75% who are employed
and finally 11.11% working as freelancers or self-employed.
When asked about their field of study/work, they gave us the results in the figure 1.6:

Figure 1.5 : Fields of study/work.


The figure 1.6 shows that the dominant field is business, with around 27.42% of the
respondents. We find at second place, computer science represented by 17.74%. IT
engineering and mechanical engineering share third place with 11.29%. At fourth place
humanities with 9.68%. Biomedical and medical field at fifth with 6.45%. Next, education
with 3.23% and communication with 1.61%. the rest were other fields not specified in the
questionnaire.
• From the results above, we can notice how the interest in AI is growing in the Tunisian
community, especially among the young generations. Many fields share this interest in a
way that we can no longer deny that AI has become a part of our lives.

15
Automatic Tunisian dialect detection

All the participant in the questionnaire were asked “how important to use AI in the Tunisian
administration’s communication?” and “why do you think is AI important?”. Their answers
are illustrated in the figure 1.7 and the figure 1.8.

Figure 1.6 : The importance of AI in communication.

Figure 1.7 : The benefits of using AI.


The figure 1.7 illustrates 77% of the participants think that using AI in communication is
important. Around 55.56% of them deemed it very important. Their reasons, represented in
the figure 1.8, are the following: 70% want to save money. 21% want to avoid crowdedness
and the rest think it will save money.

16
Automatic Tunisian dialect detection

• The majority think that AI is important to facilitate the communication between the
Tunisian administrations and the civilians to make accessing the information easier and to
save both money and time.

2.5. Work methodology


To achieve the defined objectives, a working plan must be made to ensure the quality and the
precision of the final product. We created a schedule for the progress of the project as the
following:

• Bibliographic study to sort out the importance of this work, the problems encountered
by other researchers and the results of their efforts in the field.
• Gather data related to the topic and finding an appropriate algorithm to clean it.
• Mimic the previous work mentioned in the bibliographic study and analyze the results.
• Identify the problems and find a way to deal with it.
• Adapt the appropriate strategies to create a state-of-the-art model.

Conclusion
In this chapter, we have presented the host company, its services and market structure. We
also overviewed the importance of this project for our community and explored why AI is
needed. The next chapter will be an in-depth theoretic explaining of machine learning and
NLP fundamentals on which project was built.

17
Automatic Tunisian dialect detection

Chapter 2: Theoretical Background

Introduction
In this chapter we are going to explain the theory behind machine learning and deep learning
and go through some of the math behind them. Then, we will dive in transfer learning and
transformers which are known by their state-of-the-art results.

1. Machine Learning Algorithms


1.1. Machine learning presentation
Machine learning [2] is a branch of both computer science and AI using different algorithms
and manipulating large data to mimic the way human beings learn, develop, and gradually
improve. It is becoming a major part in the field of data science since it relies on statistical
methods to train algorithms to make predictions or classification tasks. It always helps
uncovering important key insights within data mining projects. Those hidden insights are an
important part in decision-making impacting the growth of the business and its development.

1.2. Machine Learning methods


The machine learning methods are defined by whether there is a human influence on the raw
data or not, whether there is a reward system, a specific feedback is given or labels.
According to Nvidia, who is the World's First Portfolio of Purpose-Built AI Supercomputers,
there are many machine learning methods [3] from which, we are going to state four.

1.2.1. Supervised learning


Each example of the dataset comes with a label giving by the user, serving like an answer.
Just like how humans learn by the presence of someone judging whether you are right or
wrong, the labels serve the same way as the model compares them to the training result to
enhance the performance. The figure 2.1 gives an idea about the strategy behind the
supervised Learning.

Figure 2.1: Supervised Learning strategy [4].

18
Automatic Tunisian dialect detection

As shown in the figure 2.1, in the supervised learning the user must label each and every
example of the dataset. Then, the data is divided into two samples. The first sample is the
biggest and it is passed through a machine learning algorithm to adjust its parameters and
optimize it for a best performance. The second serves as test to check the accuracy of the
model in untrained situations.

1.2.2. Unsupervised Learning


In this method, we feed the algorithm with an unlabeled raw data and without any explicit
instructions. Then, the algorithm analyses and clusters the data, identifying different hidden
patterns and features without human intervention.

Figure 2.2: Unsupervised learning example: feature extraction [3].


The figure 2.2 shows an example of unsupervised learning widely used in the computer vision
field. Images are passed through the algorithm and from which it extracts many features that
can be used as labels for another program.
Depending on the problem in hand, we can distinguish a few types of unsupervised learning
algorithms such as:

• Clustering: usually use to train similar data and group them together. For example,
classifying bird species through pictures.
• Anomaly detection: unsupervised learning can be used to flag outliers in a dataset. For
example, detecting fraudulent transactions in the banking business.
• Association: By looking at a few key attributes in the data, unsupervised learning can
predict other attributes that goes along with them. For example: recommendation systems.
• Autoencoders: Autoencoders take input data, compress it into a code, then try to recreate
the input data from that summarized code. For example, by using both noisy and clean
versions of images in training, unsupervised learning can remove the noise from other
pictures.

19
Automatic Tunisian dialect detection

1.2.3. Semi-supervised Learning


This method is used when extracting relevant features is difficult and it takes a very long time
to label each example. It uses the smaller labeled data to guide the classification and the
feature extraction from a larger unlabeled data. In this way, we can overcome the problem of
not having enough training data to train a supervised learning algorithm.
An example where semi-supervised learning is commonly used is in CT scans, shown in
figure 2.3. A trained radiologist can go through a small dataset and label it for tumors or
diseases. But he cannot go through all the scans because that is a time-intensive task. Semi-
supervised learning algorithms use these small, labeled images as a guide to detect anomalies
in other scans.

Figure 2.3: Semi-supervised learning algorithm: CT scan example [3]

1.2.4. Reinforcement Learning


The reinforcement learning is similar to supervised learning. The difference is that the
algorithm is not trained using sample data. Instead, it uses a “reward-punishment” system.
This is what offers the algorithm learning materials. It works like a video game, if you
complete a level, you earn a badge. If you defeat the boss, you earn bonus experience and if
you fall into a trap, game over and you should start back from the last checkpoint.
In this method of machine learning, AI agents attempt to find an optimal way to achieve a
goal or optimize the performance of a certain task. Actions that go toward the goal receives a
reward. This method aims to predict the best way to reach the goal.
As an iterative process, the more iterations, the more the feedback from the past experiences.
Therefore, the strategy gets better, and the results are closer to the optimal one. This is useful

20
Automatic Tunisian dialect detection

in training robots with the ability to make series of decisions. For example: autonomous
vehicles and managing inventories in warehouses.

We cannot go through reinforcement learning without mentioning one of the history-making


moments, in figure 2.4, when The IBM Watson® system that won the Jeopardy challenge in
2011 [5] after beating the two former champions Ken Jennings, in the left, and Brad Rutter, in
the right.

Figure 8.4: IBM WATSON wins jeopardy [5]

1.3. Application of machine learning algorithms


The machine learning algorithms are becoming a major part in data science. They are used for
many purposes from which the following fields:

1.3.1. Regression
This powerful statistical method is used in finance and investing to predict a value from other
independent variable or series of variables [6]. The most used family is linear regression from
which we can distinguish two types:

• Simple linear regression:


(2.1)
• Multiple linear regression:
(2.2)

21
Automatic Tunisian dialect detection

where:
= the variable that you are trying to predict (dependent variable).

= the variable that you are using to predict Y (independent variable).

= the intercept.

= the slope.

= the regression residual.

1.3.2. Classification
The classification is a predictive modeling problem in which the class label is predicted for a
given examples of data as input [7]. We can encounter four main types of classification
problems which are: binary classification for tasks having only two class labels, multi-class
classification for tasks having more than two class labels, multi-label classification refers to
those classification tasks that have two or more class labels, where one or more class labels
may be predicted for each example. For example, in each photo where the model can predict
the presence of multiple classes (like detecting the presence of “tree”, “bike” and “person” in
the picture), and Imbalanced classification refers to classification tasks where the number of
examples in each class is unequally distributed like in fraud detection and medical diagnostic
tests.

1.4. Some machine learning classification algorithms


In what follows, we will present some machine learning classification algorithms:
• k-Nearest Neighbors (kNN):
The k-nearest neighbors [8] algorithm stores all training instances to training points in a n-
dimensional space. If it receives an unknown discrete element, it calculates the closest
distance to the K neighbors (with k an integer specified by the user) and returns the class to
which the k-neighbors belongs.
In the distance-weighted nearest neighbors, it weights the contribution of each of the k
neighbors by referring to their distance by using the following equation (2.3) gives greater
weight to the nearest neighbors.

(2.3)

with:
: the weight.

22
Automatic Tunisian dialect detection

: index of the unknown element.


: iteration index of the training points.
The figure 2.5 shows how kNN works in a two dimensions space.

Figure 2.5: Classification example with kNN [8].


As shown in the figure, a new point will be assigned to a certain class depending on the
nearest k-neighbors. In the case of k=3, the point will be assigned to the class B since that two
of the nearest 3-points belong to this class. However, if we choose k=7, the nearest 4 points
belong to class A and 3 to class B. and therefore, the new point is assigned to class A.
• Decision trees
Decision trees [9] are decision support tools with a tree-like architecture. Each branch
represents a possible event. It uses an if-then rule which is exhaustive for classification.
These algorithms are used mostly in operations research and specifically in decision analysis
to help creating a suitable work strategy, but it is also considered as a popular tool in machine
learning.
• Random forests
The random forests [10] are learning methods used for both regression and classification
tasks. In classification, they operate by constructing multiple decision trees for each different
class. The class having most trees will be selected as output. Random forests generally
outperform decision trees and better in avoiding overfitting problems.

• Naïve Bayes classifiers


The Naïve Bayes [11] is a probabilistic classification model based on Bayes theorem shown
in the equation (2.4) and with the assumption that all the attributes are conditionally
independent.

23
Automatic Tunisian dialect detection

(2.4)

with:
: posterior probability, : class prior probability, : likelihood and :
predictor prior probability.
The classification works by deriving the maximum posterior that is the maximal with
the above assumption applying to Bayes theorem. This will reduce the computational cost by
only counting the class distribution. Even though the assumption is not valid in most cases
since the attributes are dependent, Naive Bayes can still perform impressively.

There are many Naïve Bayes classifiers depending on the data estimation we make. If we
assume that the data is following a normal (also called Gaussian) distribution, we are talking
here about a Gaussian Naïve Bayes Classifier (also noted Gaussian NB classifier). The
probability of density given to a class can be computed with the equation 2.5.

(2.5)

with:
: mean of the values in x associated with class .
: the Bessel corrected variance of the values in x associated with class .
: a given probability of density to a class .
In the case when features are independent Booleans describing input, we can talk about a
multivariate Bernoulli event model. This model is popular for document classification. If xi is
a Boolean expressing the occurrence or absence of the ith term from the vocabulary, then the
likelihood of a document given a class Ck is given by the equation 2.6:

(2.6)

with:
: the likelihood of the document.

2. Deep Learning
We can say that deep learning is a subset of machine learning. It has artificial neural networks
that mimic the function of the human brain in processing information and decision making.

24
Automatic Tunisian dialect detection

2.1. The perceptron


It is the simplest processing element in a neural network. It is basically a neuron that fires
when reaching a certain threshold, producing a single binary input after several normalized
inputs. Each input has a correspondent weight. the values of the inputs are multiplied by their
weight and summed with a bias. The result passes through an activation function and
compared to a threshold to determine the output. The perceptron model works as illustrated by
the equation 2.7.

(2.7)

with:
: the value of the ith input,
: the weight of the ith input,
: bias.
The bias is an extra input and an external parameter of the neuron that shifts the value of the
activation function left or right. Its existence is sometimes critical to the success of the
learning process.
The figure 2.6 summarizes the architecture of the perceptron.

Figure 2.6: The perceptron architecture [12].

2.2. Multilayer perceptron (MLP)


The Multilayer perceptron [12] is as it sounds, an interconnected combination of perceptron’s
forming multiple layers as shown in the figure 2.7.

25
Automatic Tunisian dialect detection

Figure 2.7: Artificial Neural Network [12].


The first layer is called the input layer. Its number of units is equal to the dimension of the
input attributes, and it is fed directly by the input data. The last layer is the output layer. Its
number of units on the type of the problem; it takes one in regression problems and the
number of classes in classification problems. Between those two layers we find the hidden
layers. It is where most of the magic is done and usually the accuracy of the output depends
on it. The user can freely choose the number of its units for better results. If the number of the
hidden layers is two or above, we are then talking about deep learning. If not, it is a
conventional neural network.
It functions just like the perceptron. It is a map from the input to the output having multiple
weights and biases. Their values are initialized at the beginning and change in the training
process until reaching an optimal result. The update is based on an error function called loss
function and which compares the results of the model to the ground truth (the true information
given by the user).

2.3. Activation functions


The chosen of the activation functions [13] is an important task in the perceptron and neural
networks and deep learning. Therefore, the activation functions decide whether the perceptron
should fire or not and decide the result of the output. There are two types of activation
functions: linear which can be represented by a line and non-linear which are represented by
curves. They are differentiable and we can calculate their derivatives. This allows us to use
the backpropagation optimization technique to update the weights of the model for better
results.
In what follows, here are some examples of the used activation functions:

26
Automatic Tunisian dialect detection

• Sigmoid: in the equation 2.8, the sigmoid is one of the first functions to be used in neural
networks. It maps the resulting values in a range between zero and one. And for that
reason, it can be interpreted as a firing rate for the neuron with values close to zero
meaning not firing and values close to one as firing.

(2.8)

• Rectified linear (Relu): in the equation 2.9, the Relu is the most used activation function in
neural networks.
(2.9)
• Hyperbolic Tangent: Provided in the equation 2.10, hyperbolic tangent function is mostly
used for classification between two classes. The values of the results are mapped between
-1 and 1. And therefore, it makes optimization a lot easier than the Sigmoid function.

(2.10)

• Softmax: provided in the equation 2.11, softmax function is used for multiclass
classification.it calculates the probability of belonging to all possible classes and based on
these probabilities we can determine the target class for a given input.

(2.11)

2.4. Loss functions


The loss function [14] is basically a metric that gives an idea on how well the training is going
and how near is the model to reach the desired output. One of the simplest loss functions is by
subtracting the predicted value from the desired one. But the most used ones are the
following:
• Mean Square Error: also known as quadratic loss and noted L2-loss. It is mostly used in
regression tasks. The equation 2.12 is the mathematical expression for the mean squared
error function.

(2.12)

where:
: the desired output,

: the predicted output.

27
Automatic Tunisian dialect detection

• Mean Absolute Error: Also note as L1-loss. This one is used to measure the average
magnitude of the error without any consideration to its direction. The equation 2.13 is the
mathematical expression for the mean absolute error function.

(2.13)

where:
: the desired output.

: the predicted output.

• Cross-Entropy: this function is used to measure the performance of a classification model


where the output is a probability between zero and one. The value of the loss increases as
the output diverges from the actual label. The equation 2.14 is the mathematical
expression for the CorssEntropy function.

(2.14)

where:
: the desired output.

: the predicted output.

2.5. Neural network training process


To train a neural network, there are four main steps:
• Initialization: The model parameters are all initialized random values.
• Feedforward: It consist of feeding the network with the values of the training set which
will be multiplied with their respective weights and then summed together. The result of
this weighted summation passes through an activation function to determine the output.
• Backpropagation: After calculating the error with the loss function, an optimization
algorithm is in place to update the model parameters backwards. two of the mostly used
algorithms for backpropagation is “Adam” and “gradient decent”.
• Iteration: repeat the second and the third steps for a certain number of epochs specified by
the user with continuing to update the model parameters.

3. Recurrent neural networks


The recurrent neural networks [15] are a special kind of neural networks characterized by
having a self-connected node providing them with the aspect of recurrence. In sequential

28
Automatic Tunisian dialect detection

tasks, the state of the current input depends on the other previously provided inputs. So, the
role of the recurrent neural networks is to find a relationship between these inputs.
There are two major problems with the use of RNNs that happen during the training process
when the gradients are being propagated back in time and they are:
• The vanishing gradient problem: as the gradients from deeper layers must go through
many matrix multiplications, if their values are small, they will begin to shrink
exponentially until they eventually vanish making the learning process impossible to carry
out.
• The exploding gradient problem: opposite to the vanishing gradient problem, if the values
of the gradients are large, the matrix multiplications will make them grow larger and
larger until at some point, they explode and the model crushes.

3.1. Long Short-Term Memories


LSTM [16] were born to solve the two gradient problems. With a structure a bit different than
a regular RNN allowing them to decide which information to remember and which to forget,
they became able to avoid the long-term dependency and keep the important information.

Figure 2.8: Internal architecture of an LSTM unit [16].


As shown in figure 2.8, an LSTM cell contains gates. The first gate is called the forget gate in
which the previous hidden state and current input pass through a Sigmoid function. If the
value returned is zero, then the data is deemed to be forgotten. If not, it is important and
should be kept. The second gate is the input gate. This gate is responsible of assessing the
importance of the data. The hidden state and the current input pass through another Sigmoid
function. If the output is 1 then the data is important and if the output is 0, then it is not. This
is a necessary step to update the hidden state. After that, they pass into a tanh function to
squish the values between -1 and 1 which will regulate the network.

29
Automatic Tunisian dialect detection

At this point, we need to update the cell state. So, the previous cell state is multiplied by the
forget vector and then it is added to the input vector from the input gate. This gives us a new
cell state more relevant for the training process.
Finally, we find the output gate. The previous hidden state and the current input are passed
through a sigmoid function and the new cell state pass to the tanh function. Then, by
multiplying the tanh output with the sigmoid output, it can be decided what information the
output should carry.

3.2. Gated Recurrent Units


GRU [17] is quite similar to the LSTM. Both are designed similarly and can actually produce
equally excellent results. However, GRU has got rid of the cell state and used the hidden state
to transfer information. A GRU unit has only two gates: the reset gate and the update gate,
explained by the figure 2.9.

Figure 2.9: architecture of a GRU cell [17].

• Update gate: it has the same behavior of both the input gate and the forget gate in an
LSTM unit. After a weighted summation to the previous state and the current input, the
result passes into a sigmoid function. According to the result, which is between 0 and 1,
the gate decides what information to use and what to throw.
• Reset Gate: It is another gate used to decide how much past information to drop by
multiply the input and the previous state with their corresponding weights, sum the results
and feed it to the sigmoid function.

30
Automatic Tunisian dialect detection

4. The Transformer learning


4.1. Transfer learning in NLP
Transfer learning [18] is a technique in machine learning and deep learning where you train a
model to a large dataset to acquire certain knowledge that can be used to train on new,
different, and smaller datasets. We can say that we are transferring a field-depended or task-
depended knowledge from the first model to a new one to get better performance and fast
training.

The figure 2.10 summarizes the transfer learning process. The first model is usually trained on
a large unlabeled dataset (Wikipedia articles, news articles…). Technically, the labels are
contained within the text (as in the context of the sentence) and therefore, we cannot call it a
supervised learning but a self-supervised learning. The model will acquire knowledge about
the semantics behind the language that we trained on and create a general understanding about
the languages (subject-verb agreement, gender, synonyms…). The second model will use this
knowledge and fine tune it to a smaller, usually labeled dataset to achieve a certain task with
better performance and faster training than the usual methods.

Figure 2.10: Transfer Learning Diagram.


The idea behind transfer learning in NLP was to use the pretrained word embeddings in the
first layer of the new model. This way we lose whatever knowledge that was acquired by the
first model. But today, instead of taking only the first embedding layer, we take the whole
model with all its layers, and we use it to a new task. This means we need to define a pre-
training task to train our big model (which can be called a universal model since it can be
used on multiple tasks). The figure 2.11 illustrates how modern transfer learning works.

31
Automatic Tunisian dialect detection

Figure 2.11: The three steps of transfer learning in NLP [18].

4.2. Attention mechanisms


Attention mechanisms [19] are the means of passing information to the decoder. They give an
idea about which words in the input sequence relate most to the word it is outputting, either
they relate to it as context to give it meaning, or they are referred to it as "cousin" words of
close meaning. The self-attention mechanisms are similar, except that instead of operating
between the elements of the encoder and the decoder, they operate on the elements of the
input between them (the present looks at the past and the future) and of the output between
them too (the present looks at the past, since the future is yet to be generated).
To introduce attention, let us talk a bit about convolution because it is often integrated into
RNNs and responds to a similar objective, in that it provides a "context" to a sequence of
words. The input sequence can in fact be passed through one or more convolutional layers,
before passing it through the encoder. The convolution products will then extract contextual
characteristics between words located near each other, exacerbate the weight of certain words,
and attenuate the weight of certain other words, and this in a very positional manner.
Convolution allows layer after layer to extract spatial features (in various areas of the
sentence), which become finer and finer as we dive deeper into the network.
By analogy with convolution on images, to apply convolution to a sentence, it is possible, for
example, to put one word per line, each line being the vector of embeddings of the
corresponding word. Each cell of the table of figures that this produce is then the equivalent
of the light intensity of a pixel for an image.

What will come out of CNN will be an info of the type: "the words 9, 12 and 24 are very
important to give the exact meaning of this sentence, moreover it will be necessary to
combine or correlate 9 and 12, remember that when decoding the sentence ". At each step we
decide to keep a particular word (thus concentrating the information) which is supposed to be

32
Automatic Tunisian dialect detection

important when generating the next word. That said, it is very positional. Much depends on
the position of words in the sentence and their position relative to each other to create context,
more than their semantic context similarity.

4.3. The Transformers architecture


If we had to summarize the Transform architecture, described in figure 2.12, considering what
we have seen previously, we could say that it is:
• Six stacked encoders (the Nx in the figure 2.12), each encoder taking as input the output
of the previous encoder (except the first which takes the embeddings as input), followed
by six stacked decoders, taking as input the output of the previous decoder and the output
of the last encoder (except for the first decoder which takes as input only the output of the
last decoder. Note that the 12 blocks (or 24 according to the versions of BERT) do not
share the same weight matrices.
• Each encoder consists of two sublayers: a "multi-head" self-attention layer, followed by a
fully connected and position-wise FFN (i.e., each element of the previous layer's output
vector is connected to a formal neuron of the entry of the FFN, in the same order as they
are in the vector). Each sublayer has at output a layer which adds, adds, the outputs of the
layer and of the connection to a so-called residual connection (which directly connects the
input values of the layer to the output of the layer) and which normalizes the 'together.
• Each decoder consists of three layers: a "multi-head" self-attention layer, followed by an
attention layer with the last encoder, then a fully connected and position-wise FFN (i.e.,
each element of the output vector of the previous layer is connected to a formal neuron of
the input of the FFN, in the same order as they are in the vector). Each sublayer has at
output a layer which adds, adds, the outputs of the layer and of the connection to a so-
called residual connection (which directly connects the input values of the layer to the
output of the layer) and which normalizes the 'together.
• Three "key-value" type attention mechanisms (we will detail "key-value" later). Self-
attention in the encoder, self-attention with all the elements previously generated at the
input of the decoder, then always in the decoder, attention "masked" (masked attention,
because we apply a mask, we will see the details more far in the article) between the
element to be generated in the decoder and all the elements of the encoder. Note that the
layers of attention have several heads, (Multy-head), we will also come back to this in
more detail.

33
Automatic Tunisian dialect detection

• For text generation or simultaneous translation, the mechanism is auto-regressive, i.e., we


enter a sequence in the first encoder, the output predicts an element, then we go through
the entire sequence in the encoder and the element predicted in the parallel decoder in
order to generate a second element, then again, the sequence in the encoder and all the
elements already predicted in the parallel decoder... until predicting at output an <end of
sequence>.
• At the output, a probability distribution which makes it possible to predict the most
probable output element.

Figure 2.12: global architecture of the transformer [20].

5. BERT
BERT [21] is a bidirectional encoder representation from transformer. This architecture can
achieve great results in Neural Machine Translation, Question Answering, Sentiment

34
Automatic Tunisian dialect detection

Analysis, text summarization and many more tasks. All these problems require the
understanding of language, so we can use BERT to understand the language and then fine-
tune it depending on the problem we want to solve. As such, the training of BERT is done on
two phases; The first phase is pretraining, where the model understands what is language and
context. The second phase is fine-tuning learns how to solve the problem.

5.1. Pretraining BERT


In the pretraining phase, BERT learns by training on two unsupervised tasks simultaneously
which are Masked Language Modeling (MLM) and Next Sentence Prediction (NSP).

5.1.1. Masked Language Modeling


In masked language modeling [22], BERT takes in a sentence with random words filled with
masks. The goal is to output these masked tokens. This is a kind of like fill in the blanks, it
helps BERT bidirectional context within a sentence.

Figure 9 : An example of MLM training [22].


In the figure 2.13, before passing our tokens into BERT, we have masked the Lincoln token,
replacing it with [MASK].
In technical terms (explained by the figure 2.14), predicting the output words requires:
• A classification layer must be added on top of the encoder output.
• The output vectors must be multiplied by the embedding matrix. This will transform
them into the vocabulary dimensions.
• Using SoftMax to calculate the probability of each word in the vocabulary.

35
Automatic Tunisian dialect detection

Figure 10 : the changes added to the encoder for MLM [22].

5.1.2. Next Sentence Prediction


In next sentence prediction [23], BERT takes in two sentences and determines if the second
sentence follows the first. During training, 50% of the inputs are a pair in which the second
sentence is the subsequent sentence in the original document. The other 50% is a random
sentence from the corpus is chosen as the second sentence, assuming that the random sentence
will be distinguished from the original. This helps BERT understand context across different
sentences themselves.
To help the model to distinguish between the sentences during training, and as illustrated in
the figure 2.15, the input must be processed as the following before feeding it to the model:
• A [CLS] token is added at the beginning of the first sentence and a [SEP] in the end of
each one.
• A sentence embedding (similar in concept to token embeddings and having a vocabulary
of 2) indicating whether it’s sentence A or sentence B is added to each token.
• A positional embedding is added to each token indicating its position in the sequence.

Figure 11 : BERT input representation [23].

36
Automatic Tunisian dialect detection

To check whether the second sentence is needed for the first, the following steps are
performed:
• The entire input sequence is fed to the Transformer model.
• The output of the [CLS] token should be transformed into a 2×1 shaped vector, using a
simple classification layer.
• Calculate the probability of IsNextSequence with SoftMax.

5.2. Fine-Tuning BERT


As for the fine-tuning phase, we can now further train BERT on very specific NLP tasks. All
we need to do is replace the fully connected output layers of the network with a fresh set of
output layers. Then we can perform supervised training using a new dataset and it will not
take long since that it’s only the output parameters that are learning from scratch. Meaning,
most of the hyperparameters stay the same as in the pretraining process. The rest of the model
parameters are just slightly fine-tuned and as a result, the training time is fast.

5.3. BERT For Arabic Language


Spoken by around 500 million people, Arabic language is the biggest part of the Semitic
language family. It is the official language of the MENA region an integral member of the six
official UN languages and the fourth most used language on the internet. Despite its religious,
political, and cultural significance, Arabic has received little attention in modern NLP society.
This process is difficult due to the various dialects and complex morphology. Also, the
informal conversations on social media and the significant difference between dialectal
Arabic and modern standard Arabic is another factor that made the job even harder. We can
mention the lack of data as well which made understanding dialectal Arabic understudied.
However, many efforts have been made to solve these issues and to give Arabic the proper
attention it needs. Many researchers and engineers started gathering material from all the
Arabic regions and from all the domains in order to create a corpus for multiple use for future
projects. These efforts are what helped bringing to the world transformers like araBERT [24].
By gathering from Arabic Wikipedia and publicly available news articles from all different
Arab regions, around 23GB of text, which is almost 70 million sentences, developers have
trained BERT on this data using a TPUv3-8 128GB and it took around 5 days to finish
training the first ever BERT for Arabic language.
Today, we have many choices when it comes to sing BERT for Arabic language processing as
shown in the table 2.1.

37
Automatic Tunisian dialect detection

Table 2.1: Different versions of araBERT


Size Pre- Dataset
Model
(MB/Params) Segmentation (Sentences/Size/nWords)

AraBERTv0.2-base 543MB / 136M No 200M / 77GB / 8.6B

AraBERTv0.2-
1.38G / 371M No 200M / 77GB / 8.6B
large

AraBERTv2-base 543MB / 136M Yes 200M / 77GB / 8.6B

AraBERTv2-large 1.38G / 371M Yes 200M / 77GB / 8.6B


AraBERTv0.1-base 543MB / 136M No 77M / 23GB / 2.7B

AraBERTv1-base 543MB / 136M Yes 77M / 23GB / 2.7B

We have also the models developed by Ali Safaya [25] given in table 2.2.

Table 2.2: Different versions of Ali Safaya's BERT for Arabic language
Model BERT-Mini BERT-Medium BERT-Base BERT-Large
Hidden Layers 4 8 12 24
Attention heads 4 8 12 16
Hidden size 256 512 768 1024
Parameters 11M 48M 110M 340M

Conclusion
In this chapter, we explained what machine learning is and how it works. We have dived in
deep learning and the magic that makes it function. And then, we went through how transfer
learning in general, and transformers in particular work. In the next chapter, we will try those
methods in practice, examine the results and deduct the best solution for our task.

38
Automatic Tunisian dialect detection

Chapter 3: Implementation and Experimental Results


Introduction
In this chapter, we are going to introduce the working environment that allowed us to carry
our work. We will be analyzing the data we have used. Then, we present the results obtained
by the proposed models and deduct from their performance the best way to get state-of-the-art
results.

1. Work Environment
1.1. Hardware Environment
1.1.1. GPU vs CPU
For machine learning in general and deep learning, the most import part in the hardware
environment is the graphical processing unit.
A neural network might have around 100, 1000 or even 10000 unit. A normal CPU can still
manage to handle their calculations in a matter of minutes or a few hours. But what if their
parameters are millions? It would take days or probably years if the network is too big and
your computer would probably give up before finishing the task. The GPU is a better
replacement in this case. It offers the possibility to run all the processes simultaneously
instead of one after the other, making the training process faster, saving both money and
energy, and freeing CPU for other tasks. The figure 3.1 shows how a GPU can operate 10
times more than a CPU.

Figure 3.1: Theoretical Peak Point Operations per Clock Cycle [26].

39
Automatic Tunisian dialect detection

1.1.2. Google Collaboratory


To encourage on machine learning and education, google launched the Google Collaboratory
project which allows anyone to create and implement machine learning applications using
Python.
There are many other free cloud platforms for machine learning, but we chose this one to do
my project since it is very easy to use and provides the opportunity to use a GPU way better
than the one, we have. All most of the necessary libraries are installed already with the
possibility to install others if they do not exist. In addition, all the work is saved in Google
drive making it easy to access and share.

1.2. Software Environment


1.2.1. Git & Gitlab
Gitlab [27] is a DevOps tool that provides a git repository manager. This allows wiki, issue-
tracking and continuous integration, and deployment pipeline features. This tool offers a
functionality to automate the DevOps life cycle from planning all the way to deployment and
monitoring.
We have used Git, which is a tool to create projects in Gitlab, that allows and encourages
developers to have multiple local branches independent of each other. This proved to be
useful in teamwork.
1.2.2. Python
Python [28] is a high-level interpreted programming language for general-purpose. It focusses
a lot on code readability with its notable use of significant indentation. It also can be both
structured and object oriented. Since 2003, Python has been ranking in the top 10 most
popular programming languages. It is proven to be more productive than C and Java in
programming problems involving String manipulation and better in term of memory
consumption than Java.
Python is used a lot in artificial intelligence projects with its simple syntax, rich text
processing tools and multiple libraries facilitating the work and that’s what made this
language so important in my project.
1.2.3. NumPy
Created by Travis Oliphant in 2005, NumPy [29] is a python open-source library that adds
support for multi-dimensional arrays and matrices, along with a variety of high-level
mathematical functions.

40
Automatic Tunisian dialect detection

1.2.4. Pandas
Pandas [30] is the most preferred python library by data scientist for data analysis and data
manipulation. Its fast, expressive, and flexible data structure makes real-world data analysis a
significantly easy task. So many functionalities are built into this package that the options can
be overwhelming.
1.2.5. Sci-Kit learn
1.2.5. Sci-Kit learn [31] a free python machine learning library featuring many algorithms
for processing, regression, classification, and clustering and designed to work along with
NumPy and SciPy libraries.
1.2.6. PyTorch
PyTorch [32] is an open-source library for machine learning based on Torch library which is a
scientific computing framework, and a script language based on the Lua programming
language.
It was primarily developed by Facebook's AI Research lab and it is used for various
applications like computer vision and natural language processing. This library provides two
great features which are:
• a strong acceleration via graphics processing units for Tensor computing.
• Deep neural networks built on a type-based automatic differentiation system.
1.2.7. HuggingFace library
Created by Hugging Face [33], an NLP-focused startup, this library provides a state-of-the-art
pretrained Transformer’s architectures such as BERT, ELECTRA, GPT-2 and many more.
The main purpose of this library is to do a variety of NLP tasks such as text classification,
information extraction, question answering, and text generation.

2. Dataset
2.1. NADI Shared task Dataset
NADI [34] as in Nuanced Arabic Dialect Identification is a shared task that includes two
subtasks for country level identification and province level identification. This task presents a
dataset with two sets of labels for each subtask. In other words, the same tweets appear in the
two subtasks but with different labels. In addition, a script was provided for an unlabeled 10
million tweets for optional use.
For data collection, a tweeter API was used to crawl and collect data from 100 provinces
belonging to 21 Arabic country. The process took around 10 months to finish. Next step was

41
Automatic Tunisian dialect detection

identifying users who tweeted exclusively in the same province during the whole 10 months
for precision purposes in the labeling process.

2.2. Dataset Analysis


The labeled dataset consists of 30957 tweets having a length of 5 words or more and was split
into three sets: a training set with 21000 tweets, a development set with 4957 tweets and a
testing set having 5000 tweets. The distribution of the location of the tweets is presented by
table 3.1 and figure 3.2.
Table 3.1: Distribution of country-level dialect identification data across data splits

Country Number of Train Dev Test Total Percentage


name Provinces %
Algeria 7 1491 359 364 2214 7.15
Bahrain 1 210 8 20 238 0.77
Djibouti 1 210 10 51 271 0.88
Egypt 21 4473 1070 1092 6635 21.43
Iraq 12 2556 636 624 3816 12.33
Jordan 2 426 104 104 634 2.05
Kuwait 2 420 70 102 592 1.91
Lebanon 3 639 110 156 905 2.92
Libya 5 1070 265 265 1600 5.17
Mauritania 1 210 40 5 255 0.82
Morocco 5 1070 249 260 1579 5.10
Oman 6 1098 249 260 1579 5.22
Palestine 2 420 102 102 624 2.02
Qatar 2 234 104 61 399 1.29
Saudi 10 2312 579 564 3455 11.16
Arabia
Somalia 1 210 51 51 312 1.01
Sudan 1 210 51 51 312 1.01
Syria 5 1070 265 260 1595 5.15
Tunisia 4 750 164 108 1122 3.62
UAE 5 1070 265 213 1548 5.00
Yemen 4 851 206 179 1236 3.99
Total 100 21000 4957 5000 30957 100

42
Automatic Tunisian dialect detection

Figure 3.2: Distribution of classes for Training and Development sets [34].

We can distinguish some issues in the dataset. First, the most important problem is that the
data-labeling method is not ideal as one person in a certain country can use other dialects,
during the 10 months of data collecting, other than the one used in that country which is the
case for recent immigrants for example. Second, the distribution of data is unbalanced which
can be seen in figure 3.2. This unbalance will have a major impact in term of training and
therefore will affect the results of the predicting models. Third, communication in social
media is not MSA-free. Users can switch between these two varieties. And since that MSA is
common for all Arabic users, it might create a confusion in the training process. Another
encountered problem consists of having non-Arabic words in the datasets. Some languages
such as Farsi is written using the Arabic characters and therefore the API cannot find the
difference and will consider it Arabic.

2.3. Preprocessing the data


We made a simple preprocessing function to remove links, punctuations, Latin words and
repeated words with which the results were actually a bit better.
In machine learning algorithms, we have used TF-IDF which measure that evaluates how
relevant a word is to a document and then, pipelines to chain multiple estimators into one.
This is useful as there is often a fixed sequence of steps in processing the data, for example
feature selection, normalization, and classification. Pipeline serves two purposes here:
• Convenience: You only must call fit and predict once on your data to fit a whole
sequence of estimators.

43
Automatic Tunisian dialect detection

• Joint parameter selection: You can grid search over parameters of all estimators in the
pipeline at once.
For the BERT models, after tokenizing the data with BERT tokenizer rearranging the training
and development datasets, they are passed into a pytorch dataloader. We passed two variables
to the BERT’s forward function which are input_ids and attention_mask. The input_ids are
simply the numeric representations of the tokens. Attention_mask is useful when adding
padding to the input tokens. The attention mask gives an idea about which input_ids
correspond to padding. The padding is added to make sure all input sentences have the same
length to form tensor objects properly.

3. Implementation
3.1. Machine Learning Models
• k-nearest neighbors
We fed the data to the classifier and the results were very bad with 0.241 on the accuracy
score and a 0.0709 F1-score. The confusion matrix in the figure 3.3 shows that this algorithm
is a very bad choice. Basically, all the dialects were considered Egyptian considering that the
Egyptian data is the one dominating in the dataset.

Figure 3.3 : kNN classifier's confusion matrix.

44
Automatic Tunisian dialect detection

• Random forest classifier


In our problem, Random Forest achieved a 0.3296 accuracy score and 0.10407 F1-score. It is
better than the KNN classifier. The confusion matrix in figure 3.4 shows good results for
Egypt, Iraq and Algeria but the performance is still very poor for the rest of the countries.

Figure 3.4 : Random Forest classifier's confusion matrix

• Bernoulli Naive Bayes


This model achieved a 0.14019 F1-score and a 0.344 accuracy score. The results on the
confusion in the figure 3.5 were a lot better than the first two models in term of score but not
so much in the classification task.

45
Automatic Tunisian dialect detection

Figure 3.5 : Bernoulli NB classifier's confusion matrix

• Support Vector Machine


This algorithm is by far the best performing and the most promising. Achieving a 0.1696 F1-
score and a 0.349 accuracy score. The confusion matrix in figure 3.6 looks good in
comparison with the previous ones. The correct predictions have increased but we can still
find a confusion between the Golf countries which can be explained by the similarity of their
dialects.

46
Automatic Tunisian dialect detection

Figure 3.6: SVM classifier's confusion matrix

Since the SVM classifier did well, we added a linear kernel to see if it can help improving the
results. Linear Kernel is used when the data is Linearly separable, that is, it can be separated
using a single Line. It is one of the most common kernels to be used. It is mostly used when
there are many Features in a particular Data Set as in our case.

47
Automatic Tunisian dialect detection

Clearly there was a minor improvement as the F1-score became 0.1869 and the accuracy
score reached 0.3438. However, there is still confusion when it is comes to identifying the
dialect of the Golf countries illustrated in figure 3.7.

Figure 3.7: SVM classifier with linear kernel's confusion matrix

3.2. Transformers
Our training process was divided into two phases: first, tuning the language model using 10
million tweets and then tuning using 19,950 tweets for the classification task. Due to the
limited resources provided by Google Collaboratory for free users, we were not able to use
the whole 10 million unlabeled tweets. Whenever we pass the 100,000 tweets, the system
immediately crushes and the environment restarts, deleting all the previous work. To
overcome this issue, we stuck with only 100,000 unlabeled tweets and added both training

48
Automatic Tunisian dialect detection

and testing data to train the language model. The pretraining process of each model took
around 8 hours to finish for 10 epochs. When we were analyzing the loss values, we noticed
that after 3 epochs they stay stable for a while and sometimes start growing instead of
dropping meaning that the model starts getting overfitted. Therefore, the optimal number of
epochs is 3. This made the pretraining time shorter (around 4 to 5 hours). After completing
the pretraining task, we added a linear layer to the embedding model which will take care of
the classification task. Again, we have tested the training on 8 epochs which took around 4
hours. The observations of loss made it clear that 4 epochs are enough to avoid overfitting.
In this project, we couldn’t use the large models since that training them on GPU takes
forever and Google Collaboratory gives a limited time to use the GPU (around 10 hours).
We have worked with 3 different models from the huggingface transformers library which are
“aubmindlab/bert-base-arabert”, “aubmindlab/bert-base-arabertv2” and “asafaya/bert-
base-arabic”. There were some issues working with these models since the documentation
we found was outdated and a few changes were added. Basically, the debugging part took
most of the time. However, eventually everything worked out and the results were satisfying.
Table 3.2 : Transformers Performance summary
Model accuracy Macro macro macro
recall precision f1 avg
aubmindlab/bert-base-arabert 0.3082 0.1776 0.1799 0.2012

aubmindlab/bert-base-arabertv2 0.3862 0.1925 0.1947 0.2107

asafaya/bert-base-arabic 0.3912 0.2211 0.2338 0.2231

aubmindlab/bert-base-arabertv02 0.3322 0.1337 0.1243 0.1216

The best performing model is “asafaya/bert-base-arabic”.

4. Comparative analysis
in this project we have built several models to find the best one for the dialect identification
task. A comparison between the results described in table 3.3 shows the top 6 different
approaches used that helped finding the best solution. Non deep learning models had almost
similar performances. BERT model without fine-tuning of the language model is still very
powerful just by adding a classification layer. However, a language model must be added so
that the performance can exceed 20 on the F1-score which is by using the extra unlabeled
tweets since it is closer to the tweets from Wikipedia articles which the original models were
trained on.

49
Automatic Tunisian dialect detection

Table 3.3: F1 Score of different models on development set and their ranking on the task
Ranking Model F1-score
1 asafaya/bert-base-arabic 0.2231
2 aubmindlab/bert-base-arabertv2 0.2107
3 aubmindlab/bert-base-arabert 0.2012
4 SVM with linear kernel 0.1869
5 SVM 0.1696
6 Bernoulli Naive Bayes 0.1401
By looking at the confusion matrix for BERT with fine-tuning of language model in Figure 2,
we can notice the consequences of class imbalance and overlapping of the features. The most
obvious one is in the gulf countries mentioned before, we can observe that the model
predicted most of them as Saudi Arabia and did not predict any of Bahrain correctly.

Figure 3.8 : bert-base-arabic’s confusion matrix.

50
Automatic Tunisian dialect detection

5. Perspective
Even though the benchmarks are satisfying and respond to the goal of the internship, we
definitely still have more room for more improvements. The data was unbalanced when it
comes to the distribution of the tweets. This created a lot of problems in the classification task
especially while using machine leaning algorithms. In addition, Data preprocessing was not
ideal. We feel the data could be cleaned better. Also, the nature of Arabic language makes it
very hard to deal with. Therefore, and until this day, the Arabic NLP community still lacks the
tools to subdue the language for our favor. There is still a lot of research to be done that can
help us in this task.

Another way to improve the results relies on using a more trained language model. In our
project, our language model was only trained on unlabeled 100,000 tweets which is just 1% of
the data we have. This 1% made a small difference in comparison with models without a
language model. Using the rest of the unlabeled data can actually improve our results but to
do so, the training must be done on Cloud GPUs.

In addition, we have only used the base models of Bert. The large versions have more
parameters and trained on larger datasets. This mean that a large model can perform a lot
better than a base model. However, those kinds of models are very slow to train on a GPU.
We need either more GPUs working in parallel or training on TPUs.

Conclusion
In this chapter, we have presented the overall work done in this project. We discussed the
results achieved with both machine learning and deep learning algorithms. The results we
came across were decent as expected due to the limited resources we had. However, our
models can achieve a better result if they were trained on a cloud GPU.

51
Automatic Tunisian dialect detection

General Conclusion
During this internship, I had the opportunity to dive into the field of natural language
processing. Having a prior experience in machine learning and NLP with deep learning, this
project was a perfect opportunity to enlarge my knowledge, face some new challenges, and
have a wider idea about the state of art in this field.

The objective of this work was to design a state-of-the-art classification model for Arabic
dialect detection. We were able implement multiple models for dialect identification using
Sci-kit learn, PyTorch and the HuggingFace library. This was my first experience with BERT.
The results were good. They were achieved via three stages: firstly, we further pre-trained a
publicly released BERT model (i.e., Arabic-BERT) on 100000 unlabeled tweets. Secondly,
we trained the resultant model on the NADI labelled data for country-level identification
multiple times, independently. Thirdly, we selected the best performing model (based on their
performance on the development dataset), and compared it to traditional machine learning
results.

There still room for future improvements and better accuracy. We can change our
preprocessing procedure and use the whole unlabeled dataset. We can also use cloud GPUs
for faster training and try the large version of BERT.

There is no doubt this project is important for our community and we will benefit a lot from it.
It can encourage future researches like to experiment with studying the interaction between
MSA and DA in novel ways. For example, questions as to the utility of using DA data to
improve MSA regional use classification systems and vice versa can be investigated
exploiting various machine learning methods.

52
Automatic Tunisian dialect detection

References
[1] Documents provided by the host company.
[2] WANG Hua, MA Cuiqin and ZHOU Lijuan - A Brief Review of Machine Learning and
its Application - 19 Jun 2021.
[3] sha Salian – SuperVize Me: What’s the Difference Between Supervised, Unsupervised,
Semi-Supervised and Reinforcement Learning: - https ://
blogs.nvidia.com/blog/2018/08/02/supervised-unsupervised-learning/ - 2 August 2018.
[4] Mar Mbengue - Machine Learning pour débutant: Introduction au Machine Learning:
https://penseeartificielle.fr/introduction-au-machine-learning/ - 21 April 2020.
[5] New York Times - Computer Wins on ‘Jeopardy!’: Trivial, It’s Not: https
://www.nytimes.com/2011/02/17/science/17jeopardy-watson.html - 17 February 2011.
[6] Charles Manski,- « Regression », Journal of Economic Literature -1 March 1991.
[7] Sidath Asiri - Machine Learning Classifiers: https://towardsdatascience.com/machine-
learning-classifiers-a5cc4e1b0623- 11 Jun 2018.
[8] Rajvi Shah - Introduction to k-Nearest Neighbors (kNN) Algorithm -
https://ai.plainenglish.io/introduction-to-k-nearest-neighbors-knn-algorithm-e8617a448fa8- 3
March 2021.
[9] Quinlan, J. R - "Simplifying decision trees". International Journal of Man-Machine
Studies. – 1 September 1987.
[10] Ho, Tin Kam - Random Decision Forests - 6 August 2002.
[11] John, George H.; Langley, Pat. Estimating Continuous Distributions in Bayesian
Classifiers - 20 Feb 2013.
[12] B.Mehlig - Multi layer perceptron – 10 February 2021.
[13] A.Sharma V- Activation Functions - 18 Jul 2016.
[14] R.Parmar - Loss Functions: https://towardsdatascience.com/common-loss-functions-in-
machinelearning-46af0ffc4d23 - 2 September 2018.
[15] V.S. Bawa – the architecture of RNN: https://pydeeplearning.weebly.com/blog/basic-
architecture-of-rnn-and-lstm - 18 January 2017.
[16] Savvas Varsamopoulos - Designing neural network-based decoders for surface codes –
November 2018.

53
Automatic Tunisian dialect detection

[17] S.Kostadinov GRU : https://towardsdatascience.com/understanding-gru-networks-


2ef37df6c9be - 16 December 2017.
[18] Jeremy Howard, Sebastian Ruder - Universal Language Model Fine-tuning for Text
Classification – July 2018.
[19] Andrea Galassi, Marco Lippi, and Paolo Torroni - Attention in Natural Language
Processing - 11 September 2020.
[20] Ashish Vaswani - Attention Is All You Need - 6 December 2017.
[21] Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova - BERT: Pre-
training of Deep Bidirectional Transformers for Language Understanding - 24 May 2019.
[22] Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu - MASS: Masked
Sequence to Sequence Pre-training for Language Generation - 21 Jun 2019.
[23] Wei Shi, Vera Demberg - Next Sentence Prediction helps Implicit Discourse Relation
Classification within and across Domains – 3 November 2019.
[24] Wissam Antoun, Fady Baly, Hazem Hajj - AraBERT: Transformer-based Model for
Arabic Language Understanding - 7 March 2021.
[25] Ali Safaya - https://github.com/alisafaya/Arabic-BERT - 5 December 2020.
[26] Karl Rupp - FLOPs per Cycle for CPUs, GPUs and Xeon Phis: https:
//www.karlrupp.net/2016/08/flops-per-cycle-for-cpus-gpus-and-xeon-phis/ - 19 August 2016.
[27] gitlab documentation – https://www.gitlab.com, 04 July 2021.
[28] python documentation - https://www.python.org, 04 July 2021
[29] numpy documentation - https://www.numpy.org, 04 July 2021.
[30] pandas documentation – https://pandas.pydata.org, 04 July 2021.
[31] Scikit-Learn. https://scikit-learn.org, 04 July 2021.
[32] pytorch documentation - https://pytorch.org, 04 July 2021
[33] hugging face documentation - https://huggingface.co/, 04 July 2021.
[34] Muhammad Abdul-Mageed, Chiyu Zhang, Houda Bouamor, and Nizar Habashz - NADI
2020: The First Nuanced Arabic Dialect Identification Shared Task - 9 November 2020.

54
Automatic Tunisian dialect detection

Appendix

55
Automatic Tunisian dialect detection

56
Automatic Tunisian dialect detection

57

You might also like