Chapter - 1: Existing System
Chapter - 1: Existing System
INTRODUCTION
Usage of code-mixed languages are very familiar in various social media to translate these
languages was quite difficult because of the two languages are mixed together to create a sentence or a
word. These sentences are little bit different to understand because when a person has no knowledge
about that language might won’t understand the caption what the writer tend to explain.
Any translator or copywriter should have an excellent command over the language in order to translate a
text from one language to the other. But it is not an easy job because to translate a code-mixed language
one should have to have a knowledge about both the languages.
1.2. Existing System:
You can build a translator for code-mixed languages. Translator must be built from a neural
network. Translator must translate from one language to another. Translator must translate all sentences
from one language to the other language in natural language. Translator must translate all code-mixed
sentences in code-mixed language. Translator must translate all sentences in one language in natural
language. Translator must translate all sentences in one language in code-mixed language. Translator
must translate all code-mixed sentences in natural language. Translator must translate all code-mixed
languages and one language in natural language. Translator must translate all sentences and all
languages in natural language.
1.3 Proposal System:
In proposed system we have a specific translator for a code-mixed languages like tanglish(ta-en),
Hinglish(hi-en), Manglish(ma-en). This translator helps to get a native language of that code-mixed
sentence or word. Because the code-mixed languages are now- a-days used in vast and far in all kind of
social media. That’s why we choose to translate this code-mixed language in order to help people.
1
CHAPTER – 2
SYSTEM DESIGN
TOKENIZATI LANGUAGE
ON DETECTION
STEMMING USING
STOP WORD TEXTBLOB
REMOVAL PACKAGE
TEXT
TRANSLATION
Data Pre-processing
Tokenization
Stemming
Stop Word Removal
TextBlob language detection
2
Translate the sentence.
Tokenization
Natural language processing (NLP) analyses natural language, which is
human language without any computer programming involved. The process causing
the language is referred to as "human processing". Natural languages are analyzed
using statistical techniques, allowing extraction of implicit information from textual
data.
Tokenization is the breaking down of raw text into smaller components such as
words. Tokenization helps break down text into units such as words, phrases, numbers
and symbols. The components of tokenization help developers develop intelligent
applications that analyze language in order to better understand the meaning of a
sentence or document.
For Example :- How do you do?
Tokenization :- How | do | you | do | ?
When doing tokenization, it tokens all elements like (? , ’, !) in the given text.
Stemming
Stemming is a linguistic term used to refer to the reduction of a word to its
stem that’s used by suffixes and prefixes or the roots of words. It can be used in
natural language processing which helps computers understand and process human
communication as well as natural languages like English for example. When we talk
about stemming it’s important to remember that there are two types:
i) simple and
ii) morphological.
Simple stemming involves the reduction of words to regular forms, such as removing
-s, -ed, -Ing from the word walk (walk-walked-walking).
Morphological stemming refers to a morphology and inflectional systems in
languages. where grammatical relationships with stems are always noted rather than
reduced (car – cars)
3
Stop Word Removal
Stop word removal is one of the most frequently used preprocessing steps
across different NLP applications. The idea is to remove words that appear commonly
across all of the documents in a corpus. Typically, articles and pronouns are generally
classified as stop words. Often, articles such as "a," "an" and "the," along with
pronouns will be removed from a dataset for this reason.
The TextBlob package is firstly used to detect the given language also to
detect the particular language. It is important to know what languages the text
contains so that you know what to translate it into. TextBlob makes it easy to find
the language of a text. You can use Text Blob’s detect_language() function to find
out what language a given text is written in. This function returns a string with the
language that the text is written in. It supports over 150 languages.
4
CHAPTER - 3
SYSTEM IMPLEMENTATION
OS : Windows 11
Language : Python
Colab is like Google Docs for coding. It allows you to write, run and
share code in the browser. It works for many languages, and even has a Python kernel
that has been optimized to run on Google's servers, which allows for really fast code
execution.
It's also really helpful for machine learning and data analysis. Colab is
coming out of beta soon, but they're currently available for free to the public. Colab is a
virtual environment that allows you to share jupyter notebooks with others without
having to download, install, or run anything. You get to use a private vm that hosts your
5
code on Google Cloud Servers. Vms are deleted when idle for a while and have a
maximum lifetime enforced by the Colab service.
TextBlob Package
Textblob makes it really easy to analyses text using Python. All you
have to do is install the library, plug it into your code and run Textblob on some text - it's
as easy as that! You can also use this awesome library for classifying documents,
labelling chunks of texts or even just to determine if a piece of content has any sentiment
in it! If you want to avoid doing anything complicated or having to learn how neural
networks work.
Termcolor package
Colorize the terminal output of a script or application using ANSII escape
codes. To Install termcolor package !pip install termcolorp
Colors available are:
grey
red
green
yellow
blue
magenta
cyan
white
Colour highlights:
on_grey
on_red
on_green
on_yellow
on_blue
on_magenta
on_cyan
on_white
Attributes:
bold
dark
underline
blink
reverse
concealed
6
Typing Package
We have been designing a standard notation for Python function and variable
type annotations (also called typing), which can be used for documenting code in a
concise, standardized format. The project has been designed to serve any kind of
developer-training program or commercial training institute as well. It would also be
effective for writing down syntax rules so as to develop static and runtime type
checkers, static analysers, IDEs and other tools. To install typing package use ! pip
install typing.
The following procedures are required to load and run the program:
A code-mixed Translation
7
3.3 Screen Shots
8
Fig 3.3: Sample code 2
9
Fig 3.5: Errors occur while implementing
10
3.4 Source Code
translator.py
Cell1
print(my_token.split())
Cell 1
Cell 2
Cell 1
11
print(colored('\t Language translation',attrs=['bold']) )
print('\n')
word=TextBlob(input('Enter Manglish text here: '))
z=word.translate(from_lang='ml-en',to ='ml')
print(colored('\n \b Translated sentance', attrs=['bold']))
print(z)
Cell 2
12
CHAPTER - 4
RESULT AND ACCURACY
4.2 Result
Now that we have better understanding on how BLEU score works and we also have a plan
of action, it is time to start improving the translator. Bake Bot is just a prototype and there will be
some junkiness but it is ready to run and improve! The first tasks are to start with the clunkiness in
the existing code and turn it into a modular design which will makes it easy to extend and maintain.
And once the translator can run on its own, it's time to take it to the next level, adding more features
and improving the design.
13
CHAPTER-5
This project is aimed at bringing the translation power of Google translate to any website.
This project can translate any website containing English to another language of the user’s choice. It
is written in Python. It takes the input URL and performs the translation. It uses NLTK for text
processing and Google Translate forget the translation. It is a complete project that is ready to use.
This project has been tested by translating a few websites and all the output has been error free. This
project is helpful for people who are not good at English. They can now read any website in other
languages.
By Doing this Project
In the future, this project could evolve to the next level by using Deep Learning and also use
algorithms like CNN, LSTM, RNN. And also put effort to do an efficient algorithm. Currently try to
collect datasets for the future use. And aimed to score 19 in BLEU. Also creating this translator as a
Mobile application may help user to translate anywhere also going to provide web page service too.
14
Reference:
[1] Yugi SUGIYAM, Koji TROLL, Mamoru FUJII “A processing system for program specifications
in Natural Language”
[2] Bill Z. Mannaric “Natural Language Processing Tools and Environments: The Field in
Perspective”
[3] Zhaorong Zong Changchun Hong “On Application of Natural Language Processing in Machine
Translation”
[4] Utsab Barman, Amitava Das , Joachim Wagner and Jennifer Foster “Code Mixing: A Challenge
for Language Identification in the Language of Social Media”
[5] Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky,
“Beyond English-Centric Multilingual Machine Translation”
[6] Hila Gonen1, Yoav Goldberg “Language Modeling for Code-Switching: Evaluation, Integration
of Monolingual Data, and Discriminative Training”
[7] Abhirut Gupta, Aditya Vavre and Sunita Sarawagi “Training Data Augmentation for Code-Mixed
Translation”
[8] Sergey Eduno, Myle Ott, Michael Auli and David Grangier “Understanding Back-Translation at
Scale”
[9] Roee Aharoni, Melvin Johnson and Orhan Firat “Massively Multilingual Neural Machine
Translation”
[10] Vivek Srivastava, Mayank Singh “PHINC: A Parallel Hinglish Social Media Code-Mixed
Corpus for Machine Translation”
15