Text Operation Assingnmet
Text Operation Assingnmet
Text Operation Assingnmet
Write
Get unlimited access to the best of Medium for less than $1/week.
Become a member
Tariktesfa
·
Following
3 min read
10
Introduction
Amharic, the official language of Ethiopia, boasts a rich linguistic tradition and a unique
script. Preprocessing Amharic language text is a critical step in building effective machine
learning models for various natural language processing tasks. In this article, we will
explore the essential steps involved in preprocessing Amharic datasets to ensure that our
machine learning models can deliver accurate and reliable results. If you’re working on
any NLP task, these preprocessing steps will set you on the right path.
1. Data Cleaning
The first step in preprocessing an Amharic dataset is data cleaning. This involves
removing noisy or irrelevant data, such as HTML tags, special characters, or non-text
elements, which can interfere with the quality of the dataset.
2. Text Normalization
Amharic has a unique feature known as “Fidels,” characters that represent the same
sound but have different forms. Text normalization in Amharic involves converting these
characters into a consistent format. For example, we can replace various Amharic
alphabet letters with a single character that has the same pronunciation but different
symbols. This ensures uniformity in the text.
3. Tokenization
Tokenization is the process of breaking Amharic text into smaller units, such as words or
subword tokens. It is crucial to segment the text effectively for further processing.
Stop words are commonly used words in a language that are often filtered out during
natural language processing tasks because they are considered to carry less meaningful
information in the context of specific tasks. Amharic, like any other language, contains
common words that do not carry significant meaning for specific tasks. Hence, it's
important to create a list of stop words in Amharic and remove them from the text to
improve the dataset’s quality.
5. Stemming or Lemmatization
Stemming or lemmatization can be applied to reduce Amharic words to their base forms,
simplifying the vocabulary. However, this step can be challenging due to Amharic’s rich
morphology and the limited availability of resources and tools. It may require input from
language experts.
To use the Amharic text as input for machine learning models, we need to first convert it
into numerical representations such as one-hot encoding, TFIDF, BOW, or word
embeddings.
Depending on the dataset size, we can consider data augmentation techniques to increase
diversity and improve model generalization.
Exploratory data analysis can help us gain insights into our dataset, such as text length
distribution and common words, to inform preprocessing decisions.
It’s important to note that the specific preprocessing steps can vary based on the
characteristics of the Amharic dataset. Understanding the nature of the data is key to
adapting the preprocessing steps and maximizing the performance of the machine
learning models.
For further exploration into Amharic dataset preprocessing, you can refer to the
mentioned resources:
winlp2021_54_Paper.pdf
Text Preprocessing for Amharic | Data Science Projects (abe2g.github.io)
Abe2G/Amharic-Simple-Text-Preprocessing-Usin-Python: Amharic text
preprocessing. Hope this can help you. (github.com)
stopwords-am/stopwords-am.txt at main · geeztypes/stopwords-am (github.com)
irit.fr/AmharicResources/wp-content/uploads/2021/03/StopWord-list.txt
Feel free to adapt these steps to your specific project needs, and don’t hesitate to add or
remove steps as required. Preprocessing Amharic text may present challenges, but with
the right techniques and resources, we can ensure the success of the NLP applications in
this language.
Amharic
Text Preprocessing
NLP
Machine Learning
Data Preprocessing
10
Written by Tariktesfa
7 Followers
Following
Recommended from Medium
Aysel Aydin
2
Sujatha Mudadla
Lists
Help
Status
About
Careers
Blog
Privacy
Terms
Text to speech
Teams
Author: Abebawu Eshetu
Research Interest: Natural Language Processing, Machine Learning, and Computer
Vision for Social Goods.
View OnGitHub
DownloadRepository
def get_short_forms(self):
text=open(self.expansion_file_dir,encoding='utf8')
exp={}
for line in iter(text):
line=line.strip()
if not line: # line is blank
continue
else:
expanded=line.split("-")
exp[expanded[0].strip()]=expanded[1].replace("
",'_').strip()
return exp
def normalize_char_level_missmatch(input_token):
rep1=re.sub('[ሃኅኃሐሓኻ]','ሀ',input_token)
rep2=re.sub('[ሑኁዅ]','ሁ',rep1)
rep3=re.sub('[ኂሒኺ]','ሂ',rep2)
rep4=re.sub('[ኌሔዄ]','ሄ',rep3)
rep5=re.sub('[ሕኅ]','ህ',rep4)
rep6=re.sub('[ኆሖኾ]','ሆ',rep5)
rep7=re.sub('[ሠ]','ሰ',rep6)
rep8=re.sub('[ሡ]','ሱ',rep7)
rep9=re.sub('[ሢ]','ሲ',rep8)
rep10=re.sub('[ሣ]','ሳ',rep9)
rep11=re.sub('[ሤ]','ሴ',rep10)
rep12=re.sub('[ሥ]','ስ',rep11)
rep13=re.sub('[ሦ]','ሶ',rep12)
rep14=re.sub('[ዓኣዐ]','አ',rep13)
rep15=re.sub('[ዑ]','ኡ',rep14)
rep16=re.sub('[ዒ]','ኢ',rep15)
rep17=re.sub('[ዔ]','ኤ',rep16)
rep18=re.sub('[ዕ]','እ',rep17)
rep19=re.sub('[ዖ]','ኦ',rep18)
rep20=re.sub('[ጸ]','ፀ',rep19)
rep21=re.sub('[ጹ]','ፁ',rep20)
rep22=re.sub('[ጺ]','ፂ',rep21)
rep23=re.sub('[ጻ]','ፃ',rep22)
rep24=re.sub('[ጼ]','ፄ',rep23)
rep25=re.sub('[ጽ]','ፅ',rep24)
rep26=re.sub('[ጾ]','ፆ',rep25)
#Normalizing words with Labialized Amharic
characters such as በልቱዋል or በልቱአል to በልቷል
rep27=re.sub('(ሉ[ዋአ])','ሏ',rep26)
rep28=re.sub('(ሙ[ዋአ])','ሟ',rep27)
rep29=re.sub('(ቱ[ዋአ])','ቷ',rep28)
rep30=re.sub('(ሩ[ዋአ])','ሯ',rep29)
rep31=re.sub('(ሱ[ዋአ])','ሷ',rep30)
rep32=re.sub('(ሹ[ዋአ])','ሿ',rep31)
rep33=re.sub('(ቁ[ዋአ])','ቋ',rep32)
rep34=re.sub('(ቡ[ዋአ])','ቧ',rep33)
rep35=re.sub('(ቹ[ዋአ])','ቿ',rep34)
rep36=re.sub('(ሁ[ዋአ])','ኋ',rep35)
rep37=re.sub('(ኑ[ዋአ])','ኗ',rep36)
rep38=re.sub('(ኙ[ዋአ])','ኟ',rep37)
rep39=re.sub('(ኩ[ዋአ])','ኳ',rep38)
rep40=re.sub('(ዙ[ዋአ])','ዟ',rep39)
rep41=re.sub('(ጉ[ዋአ])','ጓ',rep40)
rep42=re.sub('(ደ[ዋአ])','ዷ',rep41)
rep43=re.sub('(ጡ[ዋአ])','ጧ',rep42)
rep44=re.sub('(ጩ[ዋአ])','ጯ',rep43)
rep45=re.sub('(ጹ[ዋአ])','ጿ',rep44)
rep46=re.sub('(ፉ[ዋአ])','ፏ',rep45)
rep47=re.sub('[ቊ]','ቁ',rep46) #ቁ can be written
as ቊ rep48=re.sub('[ኵ]','ኩ',rep47) #ኩ can be also
written as ኵ
return rep48
def remove_punc_and_special_chars(text):
normalized_text = re.sub('[\!\@\#\$\%\^\«\»\&\*\(\)\…\
[\]\{\}\;\“\”\›\’\‘\"\'\:\,\.\‹\/\<\>\?\\\\|\`\´\~\-\=\
+\፡\።\፤\;\፦\፥\፧\፨\፠\፣]', '',text)
return normalized_text
#remove all ascii characters and Arabic and Amharic
numbersdef remove_ascii_and_numbers(text_input):
rm_num_and_ascii=re.sub('[A-Za-z0-9]','',text_input)
return re.sub('[\'\u1369-\u137C\']
+','',rm_num_and_ascii)
The first process in this component is forming all possible bi-grams from tokenized input text.
Next, chi-square computation is applied to detect multi-words from the possible bigrams those their
chi-square value is greater than experimentally chosen threshold value.
from nltk import BigramCollocationFinderimport
nltk.collocations import ioimport reimport osclass
normalize(object):
def tokenize(self,corpus):
print('Tokenization ...')
all_tokens=[]
sentences=re.compile('[!?።(\፡\፡)]+').split(corpus)
for sentence in sentences:
tokens=sentence.split() # expecting non-
sentence identifies are already removed
all_tokens.extend(tokens)
return all_tokens
def collocation_finder(self,tokens,bigram_dir):
bigram_measures =
nltk.collocations.BigramAssocMeasures()
frequent_bigrams =
finder.nbest(bigram_measures.chi_sq,5) # chi square
computer print(frequent_bigrams)
PhraseWriter = io.open(bigram_dir, "w",
encoding="utf8")
def
normalize_multi_words(self,tokenized_sentence,bigram_dir,
corpus):
#bigram_dir: is the directory to store multi-words
bigram=set()
sent_with_bigrams=[]
index=0
if not os.path.exists(bigram_dir):
self.collocation_finder(self.tokenize(corpus),bigram_dir)
#calling itsef
self.normalize_multi_words(tokenized_sentence,bigram_dir,
corpus)
else:
text=open(bigram_dir,encoding='utf8')
for line in iter(text):
line=line.strip()
if not line: # line is blank
continue
else:
bigram.add(line)
if len(tokenized_sentence)==1:
sent_with_bigrams=tokenized_sentence
else:
while index <=len(tokenized_sentence)-2:
mword=tokenized_sentence[index]+'
'+tokenized_sentence[index+1]
if mword in bigram:
sent_with_bigrams.append(tokenized_sentence[index]
+''+tokenized_sentence[index+1])
index+=1
else:
sent_with_bigrams.append(tokenized_sentence[index])
index+=1
if index==len(tokenized_sentence)-1:
sent_with_bigrams.append(tokenized_sentence[index])
return sent_with_bigrams
def arabic2geez(arabicNumber):
ETHIOPIC_ONE= 0x1369
ETHIOPIC_TEN= 0x1372
ETHIOPIC_HUNDRED= 0x137B
ETHIOPIC_TEN_THOUSAND = 0x137C
arabicNumber=str(arabicNumber)
n = len(arabicNumber)-1 #length of arabic number
if n%2 == 0:
arabicNumber = "0" + arabicNumber
n+=1
arabicBigrams=[arabicNumber[i:i+2] for i in
range(0,n,2)] #spliting bigrams
reversedArabic=arabicBigrams[::-1] #reversing list
content geez=[]
for index,pair in enumerate(reversedArabic):
curr_geez=''
artens=pair[0]#arrabic tens
arones=pair[1]#arrabic ones amtens=''
amones=''
if artens!='0':
amtens=str(chr((int(artens) + (ETHIOPIC_TEN
- 1)))) #replacing with Geez 10s [፲,፳,፴, ...]
else:
if arones=='0': #for 00 cases
continue
if arones!='0':
amones=str(chr((int(arones) +
(ETHIOPIC_ONE - 1)))) #replacing with Geez Ones
[፩,፪,፫, ...] if index>0:
if index%2!= 0: #odd index
curr_geez=amtens+amones+ str(chr(ETHIOPIC_HUNDRED))
#appending ፻ else: #even index
curr_geez=amtens+amones+ str(chr(ETHIOPIC_TEN_THOUSAND))
# appending ፼ else: #last bigram (right most
part) curr_geez=amtens+amones
geez.append(curr_geez)
geez=''.join(geez[::-1])
if geez.startswith('፩፻') or geez.startswith('፩፼'):
geez=geez[1:]
if len(arabicNumber)>=7:
end_zeros=''.join(re.findall('([0]+)
$',arabicNumber)[0:])
i=int(len(end_zeros)/3)
if len(end_zeros)>=(3*i):
if i>=3:
i-=1
for thoushand in range(i-1):
print(thoushand)
geez+='፼'
return geez
def getExpandedNumber(self,number):
if '.' not in str(number):
return arabic2geez(number)
else:
num,decimal=str(number).split('.')
if decimal.startswith('0'):
decimal=decimal[1:]
dot=' ነጥብ ዜሮ '
else:
dot=' ነጥብ '
return arabic2geez(num)
+dot+self.arabic2geez(decimal)
Skip to content
Sign in
Abe2G/Amharic-Simple-Text-Preprocessing-Usin-PythonPublic
Notifications
Fork 0
Star 8
Code
Issues1
Pull requests
Actions
Projects
Security
Insights
master
Breadcrumbs
. Amharic-Simple-Text-Preprocessing-Usin-Python
. /
Directory actions
More options
Latest commit
Abe2G
Update README.md
History
History
master
Breadcrumbs
. Amharic-Simple-Text-Preprocessing-Usin-Python
./
Top
README.md
Amharic-Simple-Text-Preprocessing-
Usin-Python
class normalize(object):
expansion_file_dir='' # assume you have file with list of short forms with their expansion as gazeter
short_form_dict={}
# Constructor
def __init__(self):
self.short_form_dict=self.get_short_forms()
def get_short_forms(self):
text=open(self.expansion_file_dir,encoding='utf8')
exp={}
line=line.strip()
continue
else:
expanded=line.split("-")
exp[expanded[0].strip()]=expanded[1].replace(" ",'_').strip()
return exp
def expand_short_form(self,input_short_word):
if input_short_word in self.short_form_dict:
return self.short_form_dict[input_short_word]
else:
return input_short_word
def normalize_char_level_missmatch(self,input_token):
rep1=re.sub('[ሃኅኃሐሓኻ]','ሀ',input_token)
rep2=re.sub('[ሑኁዅ]','ሁ',rep1)
rep3=re.sub('[ኂሒኺ]','ሂ',rep2)
rep4=re.sub('[ኌሔዄ]','ሄ',rep3)
rep5=re.sub('[ሕኅ]','ህ',rep4)
rep6=re.sub('[ኆሖኾ]','ሆ',rep5)
rep7=re.sub('[ሠ]','ሰ',rep6)
rep8=re.sub('[ሡ]','ሱ',rep7)
rep9=re.sub('[ሢ]','ሲ',rep8)
rep10=re.sub('[ሣ]','ሳ',rep9)
rep11=re.sub('[ሤ]','ሴ',rep10)
rep12=re.sub('[ሥ]','ስ',rep11)
rep13=re.sub('[ሦ]','ሶ',rep12)
rep14=re.sub('[ዓኣዐ]','አ',rep13)
rep15=re.sub('[ዑ]','ኡ',rep14)
rep16=re.sub('[ዒ]','ኢ',rep15)
rep17=re.sub('[ዔ]','ኤ',rep16)
rep18=re.sub('[ዕ]','እ',rep17)
rep19=re.sub('[ዖ]','ኦ',rep18)
rep20=re.sub('[ጸ]','ፀ',rep19)
rep21=re.sub('[ጹ]','ፁ',rep20)
rep22=re.sub('[ጺ]','ፂ',rep21)
rep23=re.sub('[ጻ]','ፃ',rep22)
rep24=re.sub('[ጼ]','ፄ',rep23)
rep25=re.sub('[ጽ]','ፅ',rep24)
rep26=re.sub('[ጾ]','ፆ',rep25)
#Normalizing words with Labialized Amharic characters such as በልቱዋል or በልቱአል to በልቷል
rep27=re.sub('(ሉ[ዋአ])','ሏ',rep26)
rep28=re.sub('(ሙ[ዋአ])','ሟ',rep27)
rep29=re.sub('(ቱ[ዋአ])','ቷ',rep28)
rep30=re.sub('(ሩ[ዋአ])','ሯ',rep29)
rep31=re.sub('(ሱ[ዋአ])','ሷ',rep30)
rep32=re.sub('(ሹ[ዋአ])','ሿ',rep31)
rep33=re.sub('(ቁ[ዋአ])','ቋ',rep32)
rep34=re.sub('(ቡ[ዋአ])','ቧ',rep33)
rep35=re.sub('(ቹ[ዋአ])','ቿ',rep34)
rep36=re.sub('(ሁ[ዋአ])','ኋ',rep35)
rep37=re.sub('(ኑ[ዋአ])','ኗ',rep36)
rep38=re.sub('(ኙ[ዋአ])','ኟ',rep37)
rep39=re.sub('(ኩ[ዋአ])','ኳ',rep38)
rep40=re.sub('(ዙ[ዋአ])','ዟ',rep39)
rep41=re.sub('(ጉ[ዋአ])','ጓ',rep40)
rep42=re.sub('(ደ[ዋአ])','ዷ',rep41)
rep43=re.sub('(ጡ[ዋአ])','ጧ',rep42)
rep44=re.sub('(ጩ[ዋአ])','ጯ',rep43)
rep45=re.sub('(ጹ[ዋአ])','ጿ',rep44)
rep46=re.sub('(ፉ[ዋአ])','ፏ',rep45)
return rep48
return normalized_text
rm_num_and_ascii=re.sub('[A-Za-z0-9]','',text_input)
return re.sub('[\'\u1369-\u137C\']+','',rm_num_and_ascii)
BASE_DIR = '../'
DATA_DIR = BASE_DIR+'Dataset/'
MODEL_DIR='Models/'
EMBED_DIR=MODEL_DIR+'Embedding/'
PREPROCESSED_DIR=DATA_DIR +'normalized/'
import nltk.collocations
import io
import re
import os
class normalize(object):
def tokenize(self,corpus):
print('Tokenization ...')
all_tokens=[]
tokens=re.compile('[\s+]+').split(sentence)
all_tokens.extend(tokens)
return all_tokens
def get_short_forms(self,_file_dir):
text=open(_file_dir,encoding='utf8')
exp={}
line=line.strip()
continue
else:
expanded=line.split("-")
exp[expanded[0].strip()]=expanded[1].replace(" ",'_').strip()
return exp
def collocation_finder(self,tokens,bigram_dir):
bigram_measures = nltk.collocations.BigramAssocMeasures()
#Search for bigrams with in a corpus
finder = BigramCollocationFinder.from_words(tokens)
finder.apply_freq_filter(3)
print(frequent_bigrams)
def normalize_multi_words(self,tokenized_sentence,bigram_dir,corpus):
bigram=set()
sent_with_bigrams=[]
index=0
if not os.path.exists(bigram_dir):
self.collocation_finder(self.tokenize(corpus),bigram_dir)
#calling itsef
self.normalize_multi_words(tokenized_sentence,bigram_dir,corpus)
else:
text=open(bigram_dir,encoding='utf8')
line=line.strip()
continue
else:
bigram.add(line)
if len(tokenized_sentence)==1:
sent_with_bigrams=tokenized_sentence
else:
mword=tokenized_sentence[index]+' '+tokenized_sentence[index+1]
if mword in bigram:
sent_with_bigrams.append(tokenized_sentence[index]+''+tokenized_sentence[index+1])
index+=1
else:
sent_with_bigrams.append(tokenized_sentence[index])
index+=1
if index==len(tokenized_sentence)-1:
sent_with_bigrams.append(tokenized_sentence[index])
return sent_with_bigrams
def expand_short_form(self,input_short_word,_file_dir):
if not os.path.exists(_file_dir):
return input_short_word
else:
short_form_dict=self.get_short_forms(_file_dir)
if input_short_word in short_form_dict:
return short_form_dict[input_short_word]
else:
return input_short_word
def normalize_char_level_missmatch(self,input_token,lang_resource):
if not os.path.exists(lang_resource):
return input_token
else:
text=open(lang_resource,encoding='utf8')
rep=input_token
line=line.strip()
continue
else:
chars=line.split()
chars_from=chars[0]
chars_to=chars[1]
rep=re.sub('['+chars_from+']',chars_to,rep)
return rep
if not os.path.exists(lang_resource):
return sentence_input
else:
text=open(lang_resource,encoding='utf8')
chars=text.read()
sp_chars=chars.split(' ')
punct=set(sp_chars)
normalized_text=sentence_input
for p in punct:
def preprocess_text(self,text_input,model_dir,corpus):
normalzed_text=[]
CHARS_DIR=model_dir+DirConfig.CHARS_DIR
MULTI_DIR=model_dir+DirConfig.MULTI_DIR
ABRV_DIR=model_dir+DirConfig.ABRV_DIR
PUNCT_DIR=model_dir+DirConfig.PUNCT_DIR
tokens=re.compile('[\s+]+').split(sentence)
normalized_token=[]
multi_words=self.normalize_multi_words(tokens,MULTI_DIR, corpus)
short_rem=self.expand_short_form(token,ABRV_DIR)
char_normalized=self.normalize_char_level_missmatch(short_rem,CHARS_DIR)
punct_rem=self.remove_punc_and_special_chars(char_normalized,PUNCT_DIR)
normalized_token.append(punct_rem)
normalized_token.append(token)
normalzed_text.append(normalized_token)
return normalzed_text
def arabic2geez(arabicNumber):
ETHIOPIC_ONE= 0x1369
ETHIOPIC_TEN= 0x1372
ETHIOPIC_HUNDRED= 0x137B
ETHIOPIC_TEN_THOUSAND = 0x137C
arabicNumber=str(arabicNumber)
n = len(arabicNumber)-1 #length of arabic number
if n%2 == 0:
arabicNumber = "0" + arabicNumber
n+=1
arabicBigrams=[arabicNumber[i:i+2] for i in range(0,n,2)] #spliting bigrams
reversedArabic=arabicBigrams[::-1] #reversing list content
geez=[]
for index,pair in enumerate(reversedArabic):
curr_geez=''
artens=pair[0]#arrabic tens
arones=pair[1]#arrabic ones
amtens=''
amones=''
if artens!='0':
amtens=str(chr((int(artens) + (ETHIOPIC_TEN - 1)))) #replacing with Geez 10s [፲,፳,፴, ...]
else:
if arones=='0': #for 00 cases
continue
if arones!='0':
amones=str(chr((int(arones) + (ETHIOPIC_ONE - 1)))) #replacing with Geez Ones [፩,፪,፫, ...]
if index>0:
if index%2!= 0: #odd index
curr_geez=amtens+amones+ str(chr(ETHIOPIC_HUNDRED)) #appending ፻
else: #even index
curr_geez=amtens+amones+ str(chr(ETHIOPIC_TEN_THOUSAND)) # appending ፼
else: #last bigram (right most part)
curr_geez=amtens+amones
geez.append(curr_geez)
geez=''.join(geez[::-1])
if geez.startswith('፩፻') or geez.startswith('፩፼'):
geez=geez[1:]
if len(arabicNumber)>=7:
end_zeros=''.join(re.findall('([0]+)$',arabicNumber)[0:])
i=int(len(end_zeros)/3)
if len(end_zeros)>=(3*i):
if i>=3:
i-=1
for thoushand in range(i-1):
print(thoushand)
geez+='፼'
return geez
def getExpandedNumber(self,number):
if '.' not in str(number): dot=' ነጥብ '
return arabic2geez(num)+dot+self.arabic2geez(decimal)
return arabic2geez(number)
else:
num,decimal=str(number).split('.')
if decimal.startswith('0'):
decimal=decimal[1:]
dot=' ነጥብ ዜሮ '
else: