Text Operation Assingnmet

Search
Write
Get unlimited access to the best of Medium for less than $1/week.
Become a member
Preprocessing Amharic Language

Texts for NLP Applications: Step
by step
Tariktesfa
·
Following
3 min read
Oct 10, 2023
10
Introduction
Amharic, the official language of Ethiopia, boasts a rich linguistic tradition and a unique
script. Preprocessing Amharic language text is a critical step in building effective machine
learning models for various natural language processing tasks. In this article, we will
explore the essential steps involved in preprocessing Amharic datasets to ensure that our
machine learning models can deliver accurate and reliable results. If you’re working on
any NLP task, these preprocessing steps will set you on the right path.
1. Data Cleaning
The first step in preprocessing an Amharic dataset is data cleaning. This involves
removing noisy or irrelevant data, such as HTML tags, special characters, or non-text
elements, which can interfere with the quality of the dataset.
2. Text Normalization
Amharic has a unique feature known as “Fidels,” characters that represent the same
sound but have different forms. Text normalization in Amharic involves converting these
characters into a consistent format. For example, we can replace various Amharic
alphabet letters with a single character that has the same pronunciation but different
symbols. This ensures uniformity in the text.
For example, we can replace ኂሒኺ Amharic characters with the

character “ሂ” which has the same sound but a different symbol.
3. Tokenization
Tokenization is the process of breaking Amharic text into smaller units, such as words or
subword tokens. It is crucial to segment the text effectively for further processing.
4. Stop word Removal
Stop words are commonly used words in a language that are often filtered out during
natural language processing tasks because they are considered to carry less meaningful
information in the context of specific tasks. Amharic, like any other language, contains
common words that do not carry significant meaning for specific tasks. Hence, it's
important to create a list of stop words in Amharic and remove them from the text to
improve the dataset’s quality.
The compilation of Amharic stop words is available on the

following this GitHub link.
5. Stemming or Lemmatization
Stemming or lemmatization can be applied to reduce Amharic words to their base forms,
simplifying the vocabulary. However, this step can be challenging due to Amharic’s rich
morphology and the limited availability of resources and tools. It may require input from
language experts.
6. Handling Numerical Data
Depending on your dataset, decide how to handle numerical data—whether to keep,

replace, or remove it—based on your specific task. Unique Amharic (Ge’ez) numbers like
‘፩’ and ‘፪’ may need special consideration while preprocessing your Amharic dataset.
7. Amharic Text Encoding and Representation
To use the Amharic text as input for machine learning models, we need to first convert it
into numerical representations such as one-hot encoding, TFIDF, BOW, or word
embeddings.
8. Data Augmentation (optional)
Depending on the dataset size, we can consider data augmentation techniques to increase
diversity and improve model generalization.
9. Exploratory Data Analysis (EDA) (optional)
Exploratory data analysis can help us gain insights into our dataset, such as text length
distribution and common words, to inform preprocessing decisions.
It’s important to note that the specific preprocessing steps can vary based on the
characteristics of the Amharic dataset. Understanding the nature of the data is key to
adapting the preprocessing steps and maximizing the performance of the machine
learning models.
For further exploration into Amharic dataset preprocessing, you can refer to the
mentioned resources:
 winlp2021_54_Paper.pdf
 Text Preprocessing for Amharic | Data Science Projects (abe2g.github.io)
 Abe2G/Amharic-Simple-Text-Preprocessing-Usin-Python: Amharic text
preprocessing. Hope this can help you. (github.com)
 stopwords-am/stopwords-am.txt at main · geeztypes/stopwords-am (github.com)
 irit.fr/AmharicResources/wp-content/uploads/2021/03/StopWord-list.txt
Feel free to adapt these steps to your specific project needs, and don’t hesitate to add or
remove steps as required. Preprocessing Amharic text may present challenges, but with
the right techniques and resources, we can ensure the success of the NLP applications in
this language.
Amharic
Text Preprocessing
NLP
Machine Learning
Data Preprocessing
10
Written by Tariktesfa
7 Followers
Following
Recommended from Medium
Aysel Aydin
1 — Text Preprocessing Techniques for NLP

In this article, we will cover the following topics:
4 min read·Oct 4, 2023

297
2
Sujatha Mudadla
What is Parts of Speech (POS) Tagging Natural Language Processing?In
Part-of-Speech (POS) tagging is a fundamental task in Natural Language Processing (NLP)

that involves assigning a grammatical category…
5 min read·Nov 9, 2023

4
Lists
Predictive Modeling w/ Python

20 stories·1061 saves
Practical Guides to Machine Learning
Natural Language Processing

The New Chatbots: ChatGPT, Bard, and Beyond

Buse Köseoğlu
NLP — Text Preprocessing

Since the data to be used in NLP projects is text data, it has an unstructured structure and,
as in other projects, it is very important…
3 min read·Dec 15, 2023

81
Awaldeep Singh
Understanding the Essentials: NLP Text Preprocessing Steps!

Introduction
8 min read·Dec 30, 2023

2
Aneesha B Soman
Exploring Diverse Techniques for Sentence Similarity

Sentence similarity refers to the degree of similarity or closeness between two sentences in
terms of their meaning or semantic content…
11 min read·Mar 2, 2024

2
Ahmet Münir Kocaman
Mastering Named Entity Recognition with BERT: A Comprehensive Guide
Introduction
11 min read·Oct 6, 2023

70
See more recommendations
Help
Status
About
Careers
Blog
Privacy
Terms
Text to speech
Teams
Author: Abebawu Eshetu
Research Interest: Natural Language Processing, Machine Learning, and Computer
Vision for Social Goods.
 View OnGitHub
 DownloadRepository
Text Preprocessing for Amharic

When working with NLP, preprocessing text is one of important process to get clean
and formatted data before passing it to the model. Most of resourced languages suach
as English and other European countries has tools such as NLTK, that allows to
perform text preprocessing , but the same history is not true for Amharic. Amharic is
an offical language of the Ethiopian government spoken by more than 100M people
arroung the world and all over Ethiopia. Amharic script is not latin it uses geez script
and this made the steps a bit challenging to use.
The aim of this notebook is to support researchers working in NLP tasks for Amharic.
The following preprocessing steps are included:
 Short form expansion

 Multi-word detection
 Character level miss-match normalization
 Number mismatch normalization
Short Form Expansion and Character Level Normalization

To deal multi-word short form representation, the list of short forms in Amharic
language are consulted to expand a short form expression to its long form. For
example, ትምህርት ቤት can also be represented as ት/ቤት in Amharic text.
import reclass normalize(object):

expansion_file_dir='' # assume you have file with list
of short forms with their expansion as gazeter
short_form_dict={}
# Constructor def __init__(self):
self.short_form_dict=self.get_short_forms()
def get_short_forms(self):
text=open(self.expansion_file_dir,encoding='utf8')
exp={}
for line in iter(text):
line=line.strip()
if not line: # line is blank
continue
else:
expanded=line.split("-")
exp[expanded[0].strip()]=expanded[1].replace("
",'_').strip()
return exp
# method that expand short form file def

expand_short_form(self,input_short_word):
if input_short_word in self.short_form_dict:
return self.short_form_dict[input_short_word]
else:
return input_short_word
The following function performs character level mismatch normalization task.

Amharic has different characters that are interchangeably used in writing and reading
such as (ሀ, ኀ, ሐ, and ኸ), (ሰ and ሠ), (ጸ and ፀ), (ው and ዉ) and (አ and ዓ). For
example, ጸሀይ to mean sun can also be written as ፀሐይ. In addition, Amharic words
with suffix such as ቷል are also written as ቱዋል. So, I will normalize any character
under such category to common canonical representation.
def normalize_char_level_missmatch(input_token):
rep1=re.sub('[ሃኅኃሐሓኻ]','ሀ',input_token)
rep2=re.sub('[ሑኁዅ]','ሁ',rep1)
rep3=re.sub('[ኂሒኺ]','ሂ',rep2)
rep4=re.sub('[ኌሔዄ]','ሄ',rep3)
rep5=re.sub('[ሕኅ]','ህ',rep4)
rep6=re.sub('[ኆሖኾ]','ሆ',rep5)
rep7=re.sub('[ሠ]','ሰ',rep6)
rep8=re.sub('[ሡ]','ሱ',rep7)
rep9=re.sub('[ሢ]','ሲ',rep8)
rep10=re.sub('[ሣ]','ሳ',rep9)
rep11=re.sub('[ሤ]','ሴ',rep10)
rep12=re.sub('[ሥ]','ስ',rep11)
rep13=re.sub('[ሦ]','ሶ',rep12)
rep14=re.sub('[ዓኣዐ]','አ',rep13)
rep15=re.sub('[ዑ]','ኡ',rep14)
rep16=re.sub('[ዒ]','ኢ',rep15)
rep17=re.sub('[ዔ]','ኤ',rep16)
rep18=re.sub('[ዕ]','እ',rep17)
rep19=re.sub('[ዖ]','ኦ',rep18)
rep20=re.sub('[ጸ]','ፀ',rep19)
rep21=re.sub('[ጹ]','ፁ',rep20)
rep22=re.sub('[ጺ]','ፂ',rep21)
rep23=re.sub('[ጻ]','ፃ',rep22)
rep24=re.sub('[ጼ]','ፄ',rep23)
rep25=re.sub('[ጽ]','ፅ',rep24)
rep26=re.sub('[ጾ]','ፆ',rep25)
#Normalizing words with Labialized Amharic
characters such as በልቱዋል or በልቱአል to በልቷል
rep27=re.sub('(ሉ[ዋአ])','ሏ',rep26)
rep28=re.sub('(ሙ[ዋአ])','ሟ',rep27)
rep29=re.sub('(ቱ[ዋአ])','ቷ',rep28)
rep30=re.sub('(ሩ[ዋአ])','ሯ',rep29)
rep31=re.sub('(ሱ[ዋአ])','ሷ',rep30)
rep32=re.sub('(ሹ[ዋአ])','ሿ',rep31)
rep33=re.sub('(ቁ[ዋአ])','ቋ',rep32)
rep34=re.sub('(ቡ[ዋአ])','ቧ',rep33)
rep35=re.sub('(ቹ[ዋአ])','ቿ',rep34)
rep36=re.sub('(ሁ[ዋአ])','ኋ',rep35)
rep37=re.sub('(ኑ[ዋአ])','ኗ',rep36)
rep38=re.sub('(ኙ[ዋአ])','ኟ',rep37)
rep39=re.sub('(ኩ[ዋአ])','ኳ',rep38)
rep40=re.sub('(ዙ[ዋአ])','ዟ',rep39)
rep41=re.sub('(ጉ[ዋአ])','ጓ',rep40)
rep42=re.sub('(ደ[ዋአ])','ዷ',rep41)
rep43=re.sub('(ጡ[ዋአ])','ጧ',rep42)
rep44=re.sub('(ጩ[ዋአ])','ጯ',rep43)
rep45=re.sub('(ጹ[ዋአ])','ጿ',rep44)
rep46=re.sub('(ፉ[ዋአ])','ፏ',rep45)
rep47=re.sub('[ቊ]','ቁ',rep46) #ቁ can be written
as ቊ rep48=re.sub('[ኵ]','ኩ',rep47) #ኩ can be also
written as ኵ
return rep48
replacing any existance of special character or punctuation to null Amharic

puncutation marks: =፡።፤;፦፧፨፠፣
def remove_punc_and_special_chars(text):
normalized_text = re.sub('[\!\@\#\$\%\^\«\»\&\*\…\
[\]\{\}\;\“\”\›\’\‘\"\'\:\,\.\‹\/\<\>\?\\\\|\`\´\~\-\=\
+\፡\።\፤\;\፦\፥\፧\፨\፠\፣]', '',text)
return normalized_text
#remove all ascii characters and Arabic and Amharic
numbersdef remove_ascii_and_numbers(text_input):
rm_num_and_ascii=re.sub('[A-Za-z0-9]','',text_input)
return re.sub('[\'\u1369-\u137C\']
+','',rm_num_and_ascii)
Multi-word detection using with collocation finder

In natural language token can be formed from single or multi words. Thus, in order to
consider those tokens formed from multi-words, component that dedicated to detect
their existence is highly required under preprocessing stage. First I have tokenized the
each sentences into list of tokens.
In Amharic, the individual words in a sentence are separated by two dots (:
ሁለትነጥብ). The end of a sentence is marked by Amharic full stop (። አራት ነጥብ). The
symbol (፣ ነጠላ ሰረዝ) represents a comma, while (፤ ድርብ ሰረዝ) correspond to a
semicolon. ‘!’ and ‘?’ punctuations are used to end exclamatory and interogative
sentence respectively.
Then using n-gram multi word detection approach, multiwords are detected.
 The first process in this component is forming all possible bi-grams from tokenized input text.
 Next, chi-square computation is applied to detect multi-words from the possible bigrams those their
chi-square value is greater than experimentally chosen threshold value.
from nltk import BigramCollocationFinderimport
nltk.collocations import ioimport reimport osclass
normalize(object):
def tokenize(self,corpus):
print('Tokenization ...')
all_tokens=[]
sentences=re.compile('[!?።(\፡\፡)]+').split(corpus)
for sentence in sentences:
tokens=sentence.split() # expecting non-
sentence identifies are already removed
all_tokens.extend(tokens)
return all_tokens
def collocation_finder(self,tokens,bigram_dir):
bigram_measures =
nltk.collocations.BigramAssocMeasures()
#Search for bigrams with in a corpus finder =

BigramCollocationFinder.from_words(tokens)
#filter only Ngram appears morethan 3+ times

finder.apply_freq_filter(3)
frequent_bigrams =
finder.nbest(bigram_measures.chi_sq,5) # chi square
computer print(frequent_bigrams)
PhraseWriter = io.open(bigram_dir, "w",
encoding="utf8")
for bigram in frequent_bigrams:

PhraseWriter.write(bigram[0]+' '+bigram[1] + "\
n")
def
normalize_multi_words(self,tokenized_sentence,bigram_dir,
corpus):
#bigram_dir: is the directory to store multi-words
bigram=set()
sent_with_bigrams=[]
index=0
if not os.path.exists(bigram_dir):
self.collocation_finder(self.tokenize(corpus),bigram_dir)
#calling itsef
self.normalize_multi_words(tokenized_sentence,bigram_dir,
corpus)
else:
text=open(bigram_dir,encoding='utf8')
line=line.strip()
continue
else:
bigram.add(line)
if len(tokenized_sentence)==1:
sent_with_bigrams=tokenized_sentence
else:
while index <=len(tokenized_sentence)-2:
mword=tokenized_sentence[index]+'
'+tokenized_sentence[index+1]
if mword in bigram:
sent_with_bigrams.append(tokenized_sentence[index]
+''+tokenized_sentence[index+1])
index+=1
else:
sent_with_bigrams.append(tokenized_sentence[index])
index+=1
if index==len(tokenized_sentence)-1:
return sent_with_bigrams
Normalize Geez and Arabic Number Mismatch

This code snippet allows you to expand decimal form numbers to text representation.
It also automatically normalize arabic numbers to Geez form. For example, 1=፩, 2=፪,
…
def arabic2geez(arabicNumber):
ETHIOPIC_ONE= 0x1369
ETHIOPIC_TEN= 0x1372
ETHIOPIC_HUNDRED= 0x137B
ETHIOPIC_TEN_THOUSAND = 0x137C
arabicNumber=str(arabicNumber)
n = len(arabicNumber)-1 #length of arabic number
if n%2 == 0:
arabicNumber = "0" + arabicNumber
n+=1
arabicBigrams=[arabicNumber[i:i+2] for i in
range(0,n,2)] #spliting bigrams
reversedArabic=arabicBigrams[::-1] #reversing list
content geez=[]
for index,pair in enumerate(reversedArabic):
curr_geez=''
artens=pair[0]#arrabic tens
arones=pair[1]#arrabic ones amtens=''
amones=''
if artens!='0':
amtens=str(chr((int(artens) + (ETHIOPIC_TEN
- 1)))) #replacing with Geez 10s [፲,፳,፴, ...]
else:
if arones=='0': #for 00 cases
continue
if arones!='0':
amones=str(chr((int(arones) +
(ETHIOPIC_ONE - 1)))) #replacing with Geez Ones
[፩,፪,፫, ...] if index>0:
if index%2!= 0: #odd index
curr_geez=amtens+amones+ str(chr(ETHIOPIC_HUNDRED))
#appending ፻ else: #even index
curr_geez=amtens+amones+ str(chr(ETHIOPIC_TEN_THOUSAND))
# appending ፼ else: #last bigram (right most
part) curr_geez=amtens+amones
geez.append(curr_geez)
geez=''.join(geez[::-1])
if geez.startswith('፩፻') or geez.startswith('፩፼'):
geez=geez[1:]
if len(arabicNumber)>=7:
end_zeros=''.join(re.findall('([0]+)
$',arabicNumber)[0:])
i=int(len(end_zeros)/3)
if len(end_zeros)>=(3*i):
if i>=3:
i-=1
for thoushand in range(i-1):
print(thoushand)
geez+='፼'
return geez
def getExpandedNumber(self,number):
if '.' not in str(number):
return arabic2geez(number)
else:
num,decimal=str(number).split('.')
if decimal.startswith('0'):
decimal=decimal[1:]
dot=' ነጥብ ዜሮ '
else:
dot=' ነጥብ '
return arabic2geez(num)
+dot+self.arabic2geez(decimal)
Your comments are my teacher. So drop any comments.
Skip to content
Sign in





























Abe2G/Amharic-Simple-Text-Preprocessing-Usin-PythonPublic
 Notifications
 Fork 0

Star 8


 Code
 Issues1
 Pull requests
 Actions
 Projects
 Security
 Insights
master
Breadcrumbs
. Amharic-Simple-Text-Preprocessing-Usin-Python
. /
Directory actions
More options
Latest commit
Abe2G
Update README.md
01f8b05 · 5 years ago
History
History
master
Breadcrumbs
. Amharic-Simple-Text-Preprocessing-Usin-Python
./
Top
Folders and files

Name Last commit message Last commit date
README.md Update README.md 5 years ago
README.md
Amharic-Simple-Text-Preprocessing-
Usin-Python
Short Form Expansion and Character

Level Normalization
import re
class normalize(object):
expansion_file_dir='' # assume you have file with list of short forms with their expansion as gazeter
short_form_dict={}
# Constructor
def __init__(self):
self.short_form_dict=self.get_short_forms()
def get_short_forms(self):
text=open(self.expansion_file_dir,encoding='utf8')
exp={}
line=line.strip()
continue
else:
exp[expanded[0].strip()]=expanded[1].replace(" ",'_').strip()
return exp
# method that expand short form file
def expand_short_form(self,input_short_word):
if input_short_word in self.short_form_dict:
return self.short_form_dict[input_short_word]
else:
#method to normalize character level missmatch such as ጸሀይ and ፀሐይ
def normalize_char_level_missmatch(self,input_token):
rep1=re.sub('[ሃኅኃሐሓኻ]','ሀ',input_token)
rep2=re.sub('[ሑኁዅ]','ሁ',rep1)
rep3=re.sub('[ኂሒኺ]','ሂ',rep2)
rep4=re.sub('[ኌሔዄ]','ሄ',rep3)
rep5=re.sub('[ሕኅ]','ህ',rep4)
rep6=re.sub('[ኆሖኾ]','ሆ',rep5)
rep7=re.sub('[ሠ]','ሰ',rep6)
rep8=re.sub('[ሡ]','ሱ',rep7)
rep9=re.sub('[ሢ]','ሲ',rep8)
rep10=re.sub('[ሣ]','ሳ',rep9)
rep11=re.sub('[ሤ]','ሴ',rep10)
rep12=re.sub('[ሥ]','ስ',rep11)
rep13=re.sub('[ሦ]','ሶ',rep12)
rep14=re.sub('[ዓኣዐ]','አ',rep13)
rep15=re.sub('[ዑ]','ኡ',rep14)
rep16=re.sub('[ዒ]','ኢ',rep15)
rep17=re.sub('[ዔ]','ኤ',rep16)
rep18=re.sub('[ዕ]','እ',rep17)
rep19=re.sub('[ዖ]','ኦ',rep18)
rep20=re.sub('[ጸ]','ፀ',rep19)
rep21=re.sub('[ጹ]','ፁ',rep20)
rep22=re.sub('[ጺ]','ፂ',rep21)
rep23=re.sub('[ጻ]','ፃ',rep22)
rep24=re.sub('[ጼ]','ፄ',rep23)
rep25=re.sub('[ጽ]','ፅ',rep24)
rep26=re.sub('[ጾ]','ፆ',rep25)
#Normalizing words with Labialized Amharic characters such as በልቱዋል or በልቱአል to በልቷል
rep27=re.sub('(ሉ[ዋአ])','ሏ',rep26)
rep28=re.sub('(ሙ[ዋአ])','ሟ',rep27)
rep29=re.sub('(ቱ[ዋአ])','ቷ',rep28)
rep30=re.sub('(ሩ[ዋአ])','ሯ',rep29)
rep31=re.sub('(ሱ[ዋአ])','ሷ',rep30)
rep32=re.sub('(ሹ[ዋአ])','ሿ',rep31)
rep33=re.sub('(ቁ[ዋአ])','ቋ',rep32)
rep34=re.sub('(ቡ[ዋአ])','ቧ',rep33)
rep35=re.sub('(ቹ[ዋአ])','ቿ',rep34)
rep36=re.sub('(ሁ[ዋአ])','ኋ',rep35)
rep37=re.sub('(ኑ[ዋአ])','ኗ',rep36)
rep38=re.sub('(ኙ[ዋአ])','ኟ',rep37)
rep39=re.sub('(ኩ[ዋአ])','ኳ',rep38)
rep40=re.sub('(ዙ[ዋአ])','ዟ',rep39)
rep41=re.sub('(ጉ[ዋአ])','ጓ',rep40)
rep42=re.sub('(ደ[ዋአ])','ዷ',rep41)
rep43=re.sub('(ጡ[ዋአ])','ጧ',rep42)
rep44=re.sub('(ጩ[ዋአ])','ጯ',rep43)
rep45=re.sub('(ጹ[ዋአ])','ጿ',rep44)
rep46=re.sub('(ፉ[ዋአ])','ፏ',rep45)
rep47=re.sub('[ቊ]','ቁ',rep46) #ቁ can be written as ቊ
rep48=re.sub('[ኵ]','ኩ',rep47) #ኩ can be also written as ኵ
return rep48
#replacing any existance of special character or punctuation to null
def remove_punc_and_special_chars(self,sentence_input): # puct in amh =፡።፤;፦፧፨፠፣
normalized_text = re.sub('[\!\@\#\$\%\^\«\»\&\*\…\[\]\{\}\;\“\”\›\’\‘\"\'\:\,\.\‹\/\<\>\?\\\\|\`\´\~\-\=\+\፡\።\፤\;\፦\፥\፧\፨\፠\፣]', '',sentence_input)
#remove all ascii characters and Arabic and Amharic numbers

def remove_ascii_and_numbers(self,text_input):
rm_num_and_ascii=re.sub('[A-Za-z0-9]','',text_input)
return re.sub('[\'\u1369-\u137C\']+','',rm_num_and_ascii)
A little bit extended version

class DirConfig(object):
BASE_DIR = '../'
DATA_DIR = BASE_DIR+'Dataset/'
MODEL_DIR='Models/'
EMBED_DIR=MODEL_DIR+'Embedding/'
PREPROCESSED_DIR=DATA_DIR +'normalized/'
from nltk import BigramCollocationFinder
import nltk.collocations
import io
import re
import os
class normalize(object):
def tokenize(self,corpus):
print('Tokenization ...')
all_tokens=[]
for sentence in corpus:
tokens=re.compile('[\s+]+').split(sentence)
all_tokens.extend(tokens)
return all_tokens
def get_short_forms(self,_file_dir):
text=open(_file_dir,encoding='utf8')
exp={}
line=line.strip()
continue
else:
exp[expanded[0].strip()]=expanded[1].replace(" ",'_').strip()
return exp
def collocation_finder(self,tokens,bigram_dir):
bigram_measures = nltk.collocations.BigramAssocMeasures()
#Search for bigrams with in a corpus
finder = BigramCollocationFinder.from_words(tokens)
#filter only Ngram appears morethan 3+ times
finder.apply_freq_filter(3)
frequent_bigrams = finder.nbest(bigram_measures.chi_sq,5) # chi square computer
print(frequent_bigrams)
PhraseWriter = io.open(bigram_dir, "w", encoding="utf8")
for bigram in frequent_bigrams:
PhraseWriter.write(bigram[0]+' '+bigram[1] + "\n")
def normalize_multi_words(self,tokenized_sentence,bigram_dir,corpus):
bigram=set()
sent_with_bigrams=[]
index=0
if not os.path.exists(bigram_dir):
self.collocation_finder(self.tokenize(corpus),bigram_dir)
#calling itsef
self.normalize_multi_words(tokenized_sentence,bigram_dir,corpus)
else:
text=open(bigram_dir,encoding='utf8')
line=line.strip()
continue
else:
bigram.add(line)
if len(tokenized_sentence)==1:
sent_with_bigrams=tokenized_sentence
else:
while index <=len(tokenized_sentence)-2:
mword=tokenized_sentence[index]+' '+tokenized_sentence[index+1]
if mword in bigram:
sent_with_bigrams.append(tokenized_sentence[index]+''+tokenized_sentence[index+1])
index+=1
else:
index+=1
if index==len(tokenized_sentence)-1:
return sent_with_bigrams
# method that expand short form file
def expand_short_form(self,input_short_word,_file_dir):
if not os.path.exists(_file_dir):
else:
short_form_dict=self.get_short_forms(_file_dir)
if input_short_word in short_form_dict:
return short_form_dict[input_short_word]
else:
#method to normalize character level missmatch such as ጸሀይ and ፀሐይ
def normalize_char_level_missmatch(self,input_token,lang_resource):
if not os.path.exists(lang_resource):
return input_token
else:
text=open(lang_resource,encoding='utf8')
rep=input_token
line=line.strip()
continue
else:
chars=line.split()
chars_from=chars[0]
chars_to=chars[1]
rep=re.sub('['+chars_from+']',chars_to,rep)
return rep
#replacing any existance of special character or punctuation to null
def remove_punc_and_special_chars(self,sentence_input,lang_resource): # puct in amh =፡።፤;፦፧፨፠፣
if not os.path.exists(lang_resource):
return sentence_input
else:
text=open(lang_resource,encoding='utf8')
chars=text.read()
sp_chars=chars.split(' ')
punct=set(sp_chars)
normalized_text=sentence_input
for p in punct:
normalized_text = re.sub('[\\'+p+']', '',normalized_text)

def preprocess_text(self,text_input,model_dir,corpus):
normalzed_text=[]
CHARS_DIR=model_dir+DirConfig.CHARS_DIR
MULTI_DIR=model_dir+DirConfig.MULTI_DIR
ABRV_DIR=model_dir+DirConfig.ABRV_DIR
PUNCT_DIR=model_dir+DirConfig.PUNCT_DIR
print('Preprocessing '+str(len(text_input))+' sentences ....')
for sentence in text_input:
tokens=re.compile('[\s+]+').split(sentence)
normalized_token=[]
multi_words=self.normalize_multi_words(tokens,MULTI_DIR, corpus)
for token in tokens:
short_rem=self.expand_short_form(token,ABRV_DIR)
char_normalized=self.normalize_char_level_missmatch(short_rem,CHARS_DIR)
punct_rem=self.remove_punc_and_special_chars(char_normalized,PUNCT_DIR)
normalized_token.append(punct_rem)
normalized_token.append(token)
normalzed_text.append(normalized_token)
return normalzed_text
Normalize Geez and Arabic Number

Mismatch
This code snippet allows you to expand decimal form numbers to text representation. It also automatically normalize arabic
numbers to Geez form.
def arabic2geez(arabicNumber):
ETHIOPIC_ONE= 0x1369
ETHIOPIC_TEN= 0x1372
ETHIOPIC_HUNDRED= 0x137B
ETHIOPIC_TEN_THOUSAND = 0x137C
arabicNumber=str(arabicNumber)
n = len(arabicNumber)-1 #length of arabic number
if n%2 == 0:
arabicNumber = "0" + arabicNumber
n+=1
arabicBigrams=[arabicNumber[i:i+2] for i in range(0,n,2)] #spliting bigrams
reversedArabic=arabicBigrams[::-1] #reversing list content
geez=[]
for index,pair in enumerate(reversedArabic):
curr_geez=''
artens=pair[0]#arrabic tens
arones=pair[1]#arrabic ones
amtens=''
amones=''
if artens!='0':
amtens=str(chr((int(artens) + (ETHIOPIC_TEN - 1)))) #replacing with Geez 10s [፲,፳,፴, ...]
else:
if arones=='0': #for 00 cases
continue
if arones!='0':
amones=str(chr((int(arones) + (ETHIOPIC_ONE - 1)))) #replacing with Geez Ones [፩,፪,፫, ...]
if index>0:
if index%2!= 0: #odd index
curr_geez=amtens+amones+ str(chr(ETHIOPIC_HUNDRED)) #appending ፻
else: #even index
curr_geez=amtens+amones+ str(chr(ETHIOPIC_TEN_THOUSAND)) # appending ፼
else: #last bigram (right most part)
curr_geez=amtens+amones
geez.append(curr_geez)
geez=''.join(geez[::-1])
if geez.startswith('፩፻') or geez.startswith('፩፼'):
geez=geez[1:]
if len(arabicNumber)>=7:
end_zeros=''.join(re.findall('([0]+)$',arabicNumber)[0:])
i=int(len(end_zeros)/3)
if len(end_zeros)>=(3*i):
if i>=3:
i-=1
for thoushand in range(i-1):
print(thoushand)
geez+='፼'
return geez
def getExpandedNumber(self,number):
if '.' not in str(number): dot=' ነጥብ '
return arabic2geez(num)+dot+self.arabic2geez(decimal)
return arabic2geez(number)
else:
num,decimal=str(number).split('.')
if decimal.startswith('0'):
decimal=decimal[1:]
dot=' ነጥብ ዜሮ '
else:
Amharic-Simple-Text-Preprocessing-Usin-Python/ at master · Abe2G/Amharic-Simple-Text-Preprocessing-Usin-Python · GitHub

Text Operation Assingnmet

Uploaded by

Copyright:

Available Formats

Text Operation Assingnmet

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Text Operation Assingnmet

Uploaded by

Copyright:

Available Formats

Search

Preprocessing Amharic Language

Oct 10, 2023

For example, we can replace ኂሒኺ Amharic characters with the

4. Stop word Removal

The compilation of Amharic stop words is available on the

6. Handling Numerical Data

Depending on your dataset, decide how to handle numerical data—whether to keep,

8. Data Augmentation (optional)

9. Exploratory Data Analysis (EDA) (optional)

1 — Text Preprocessing Techniques for NLP

4 min read·Oct 4, 2023

What is Parts of Speech (POS) Tagging Natural Language Processing?In

Part-of-Speech (POS) tagging is a fundamental task in Natural Language Processing (NLP)

5 min read·Nov 9, 2023

Predictive Modeling w/ Python

Natural Language Processing

The New Chatbots: ChatGPT, Bard, and Beyond

NLP — Text Preprocessing

3 min read·Dec 15, 2023

Understanding the Essentials: NLP Text Preprocessing Steps!

8 min read·Dec 30, 2023

Exploring Diverse Techniques for Sentence Similarity

11 min read·Mar 2, 2024

11 min read·Oct 6, 2023

See more recommendations

Text Preprocessing for Amharic

 Short form expansion

Short Form Expansion and Character Level Normalization

import reclass normalize(object):

# method that expand short form file def

The following function performs character level mismatch normalization task.

replacing any existance of special character or punctuation to null Amharic

Multi-word detection using with collocation finder

#Search for bigrams with in a corpus finder =

#filter only Ngram appears morethan 3+ times

for bigram in frequent_bigrams:

Normalize Geez and Arabic Number Mismatch

Your comments are my teacher. So drop any comments.

01f8b05 · 5 years ago

Folders and files

README.md Update README.md 5 years ago

Short Form Expansion and Character

for line in iter(text):

if not line: # line is blank

# method that expand short form file

#method to normalize character level missmatch such as ጸሀይ and ፀሐይ

rep47=re.sub('[ቊ]','ቁ',rep46) #ቁ can be written as ቊ

rep48=re.sub('[ኵ]','ኩ',rep47) #ኩ can be also written as ኵ

#replacing any existance of special character or punctuation to null

def remove_punc_and_special_chars(self,sentence_input): # puct in amh =፡።፤;፦፧፨፠፣

normalized_text = re.sub('[\!\@\#\$\%\^\«\»\&\*\(\)\…\[\]\{\}\;\“\”\›\’\‘\"\'\:\,\.\‹\/\<\>\?\\\\|\`\´\~\-\=\+\፡\።\፤\;\፦\፥\፧\፨\፠\፣]', '',sentence_input)

#remove all ascii characters and Arabic and Amharic numbers

A little bit extended version

from nltk import BigramCollocationFinder

for sentence in corpus:

for line in iter(text):

if not line: # line is blank

#filter only Ngram appears morethan 3+ times