100% found this document useful (1 vote)
179 views15 pages

Chapter - 1: Existing System

The document introduces a system for translating code-mixed languages. It describes code-mixed languages as those that mix multiple languages together, making translation difficult without knowledge of both languages. The proposed system would have specialized translators for code-mixed languages like Tanglish and Hinglish to help people understand these languages increasingly used on social media. The system design includes modules for data pre-processing, language detection using TextBlob, and translation of sentences. The implementation uses Python, Google Colab, TextBlob and termcolor packages to build the translator.

Uploaded by

Bavithraa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
179 views15 pages

Chapter - 1: Existing System

The document introduces a system for translating code-mixed languages. It describes code-mixed languages as those that mix multiple languages together, making translation difficult without knowledge of both languages. The proposed system would have specialized translators for code-mixed languages like Tanglish and Hinglish to help people understand these languages increasingly used on social media. The system design includes modules for data pre-processing, language detection using TextBlob, and translation of sentences. The implementation uses Python, Google Colab, TextBlob and termcolor packages to build the translator.

Uploaded by

Bavithraa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

CHAPTER - 1

INTRODUCTION

1.1 Problem Description:

Usage of code-mixed languages are very familiar in various social media to translate these
languages was quite difficult because of the two languages are mixed together to create a sentence or a
word. These sentences are little bit different to understand because when a person has no knowledge
about that language might won’t understand the caption what the writer tend to explain.
Any translator or copywriter should have an excellent command over the language in order to translate a
text from one language to the other. But it is not an easy job because to translate a code-mixed language
one should have to have a knowledge about both the languages.
1.2. Existing System:

You can build a translator for code-mixed languages. Translator must be built from a neural
network. Translator must translate from one language to another. Translator must translate all sentences
from one language to the other language in natural language. Translator must translate all code-mixed
sentences in code-mixed language. Translator must translate all sentences in one language in natural
language. Translator must translate all sentences in one language in code-mixed language. Translator
must translate all code-mixed sentences in natural language. Translator must translate all code-mixed
languages and one language in natural language. Translator must translate all sentences and all
languages in natural language.
1.3 Proposal System:

In proposed system we have a specific translator for a code-mixed languages like tanglish(ta-en),
Hinglish(hi-en), Manglish(ma-en). This translator helps to get a native language of that code-mixed
sentence or word. Because the code-mixed languages are now- a-days used in vast and far in all kind of
social media. That’s why we choose to translate this code-mixed language in order to help people.

1
CHAPTER – 2

SYSTEM DESIGN

2.1 System Flow diagram

ENTER THE TEXT PRE-


TEXT PROCESSING

TOKENIZATI LANGUAGE
ON DETECTION
STEMMING USING
STOP WORD TEXTBLOB
REMOVAL PACKAGE

TEXT
TRANSLATION

Fig -2.1 System Flow Diagram

2.2 Module Design

 Data Pre-processing
 Tokenization
 Stemming
 Stop Word Removal
 TextBlob language detection

2
 Translate the sentence.

2.2.1 Data Pre-processing

Tokenization
Natural language processing (NLP) analyses natural language, which is
human language without any computer programming involved. The process causing
the language is referred to as "human processing". Natural languages are analyzed
using statistical techniques, allowing extraction of implicit information from textual
data.
Tokenization is the breaking down of raw text into smaller components such as
words. Tokenization helps break down text into units such as words, phrases, numbers
and symbols. The components of tokenization help developers develop intelligent
applications that analyze language in order to better understand the meaning of a
sentence or document.
For Example :- How do you do?
Tokenization :- How | do | you | do | ?
When doing tokenization, it tokens all elements like (? , ’, !) in the given text.

Stemming
Stemming is a linguistic term used to refer to the reduction of a word to its
stem that’s used by suffixes and prefixes or the roots of words. It can be used in
natural language processing which helps computers understand and process human
communication as well as natural languages like English for example. When we talk
about stemming it’s important to remember that there are two types:
i) simple and
ii) morphological.
Simple stemming involves the reduction of words to regular forms, such as removing
-s, -ed, -Ing from the word walk (walk-walked-walking).
Morphological stemming refers to a morphology and inflectional systems in
languages. where grammatical relationships with stems are always noted rather than
reduced (car – cars)

3
Stop Word Removal

Stop word removal is one of the most frequently used preprocessing steps
across different NLP applications. The idea is to remove words that appear commonly
across all of the documents in a corpus. Typically, articles and pronouns are generally
classified as stop words. Often, articles such as "a," "an" and "the," along with
pronouns will be removed from a dataset for this reason.

2.2.2 Package Installation


To install the TextBlob package !pip install
2.2.3 TextBlob Language Detection

The TextBlob package is firstly used to detect the given language also to
detect the particular language. It is important to know what languages the text
contains so that you know what to translate it into. TextBlob makes it easy to find
the language of a text. You can use Text Blob’s detect_language() function to find
out what language a given text is written in. This function returns a string with the
language that the text is written in. It supports over 150 languages.

2.2.4 Translate the Sentences

TextBlob is a very simple Python library that allows you to manipulate


human-readable text like dates, numbers, addresses and more. It can be used in
multiple different ways.
This is command line:
!pip install TextBlob Then this command is used to import this package from
textblob import TextBlob
Next, Admin assign a string to the constructor. This string is going to be the text that
we want to manipulate. The translate function accepts two arguments - from_lang and
to. The from_lang is automatically set depending on the language TextBlob detects.
The two argument is the language you'd like to translate from_lang to.

4
CHAPTER - 3
SYSTEM IMPLEMENTATION

3.1 Software Requirements:

OS : Windows 11

IDE : Google Colab

Language : Python

Front End : Google Colab

Back End : TextBlob Package

3.2 Software Package Details

3.2.1 Python Language

Python is a programming language that can be used to develop all


different kinds of software. It was designed with a clear and clean syntax that allows
programmers to code software in fewer lines of code, but without sacrificing
functionality. It's the high-level built-in data structures like lists and dictionaries,
combined with dynamic typing and dynamic binding, that make it ideal for rapid
application development projects and for use as a glue language between individual
components because Python's simple syntax emphasizes readability among other things.
Industry leaders also find Python perfect for building test harnesses that are used during
the QA process.

3.2.2 Google Colaboratory

Colab is like Google Docs for coding. It allows you to write, run and
share code in the browser. It works for many languages, and even has a Python kernel
that has been optimized to run on Google's servers, which allows for really fast code
execution.
It's also really helpful for machine learning and data analysis. Colab is
coming out of beta soon, but they're currently available for free to the public. Colab is a
virtual environment that allows you to share jupyter notebooks with others without
having to download, install, or run anything. You get to use a private vm that hosts your

5
code on Google Cloud Servers. Vms are deleted when idle for a while and have a
maximum lifetime enforced by the Colab service.

TextBlob Package
Textblob makes it really easy to analyses text using Python. All you
have to do is install the library, plug it into your code and run Textblob on some text - it's
as easy as that! You can also use this awesome library for classifying documents,
labelling chunks of texts or even just to determine if a piece of content has any sentiment
in it! If you want to avoid doing anything complicated or having to learn how neural
networks work.
Termcolor package
Colorize the terminal output of a script or application using ANSII escape
codes. To Install termcolor package !pip install termcolorp
Colors available are:

 grey
 red
 green
 yellow
 blue
 magenta
 cyan
 white

Colour highlights:

 on_grey
 on_red
 on_green
 on_yellow
 on_blue
 on_magenta
 on_cyan
 on_white

Attributes:

 bold
 dark
 underline
 blink
 reverse
 concealed

6
Typing Package

We have been designing a standard notation for Python function and variable
type annotations (also called typing), which can be used for documenting code in a
concise, standardized format. The project has been designed to serve any kind of
developer-training program or commercial training institute as well. It would also be
effective for writing down syntax rules so as to develop static and runtime type
checkers, static analysers, IDEs and other tools. To install typing package use ! pip
install typing.

3.3 Implementation Details:

The following procedures are required to load and run the program:

 Open the web Browser


 Open Google Co laboratory
 Sign in with the Google account
 Now you can run the code

A code-mixed Translation

A challenge in training a code-mixed to English machine translation


model is the lack of parallel data. One significant example of this is shared on social
media sites like Twitter where a word or phrase can be plucked from another language
and then vitalized into an internet phenomenon where it may influence other accounts
to mix words because they find the word or phrase. Translating code is a relatively
unexplored task, however we can try to assess how we can translate programming
languages. However there are a lot of challenges, one of the biggest ones is the lack of
large parallel training data. Programs are often quite long so it's hard to get a lot of
examples.

7
3.3 Screen Shots

Fig-3.1: Google Colaboratory

Fig -3.2: Sample code 1

8
Fig 3.3: Sample code 2

Fig 3.4: Sample code 3

9
Fig 3.5: Errors occur while implementing

Fig-3.6 Sample code 4

10
3.4 Source Code

translator.py

Coding for tokenization

Cell1

my_token= """Come let’s go and stay for a while."""

print(my_token.split())

Coding Translator for Tanglish to English conversion

Cell 1

from termcolor import colored


from typing import TYPE_CHECKING
from textblob import TextBlob
print(colored('\t Language translation','cyan', attrs=['bold']))
print('\n')
word=TextBlob(input('Enter the Tanglish text here: '))
z=word.translate(from_lang='ta-en',to='ta')
print(colored('\n \t Translated sentence','cyan', attrs=['bold']))
print('\n \t',z)

Cell 2

from termcolor import colored


print(colored('\n \t Translate tanglish to English','cyan', attrs=['bold']))
print('\n\t')
print('\t \t',z.translate(from_lang='ta',to='en'))

Coding Translator for Manglish to English conversion

Cell 1

from typing import TYPE_CHECKING


from termcolor import colored
from textblob import TextBlob

11
print(colored('\t Language translation',attrs=['bold']) )
print('\n')
word=TextBlob(input('Enter Manglish text here: '))
z=word.translate(from_lang='ml-en',to ='ml')
print(colored('\n \b Translated sentance', attrs=['bold']))
print(z)

Cell 2

from termcolor import colored


print(colored('\n \t Translate manglish to english','cyan', attrs=['bold']))
print('\n\t')
print('\t \t',z.translate(from_lang='ml',to='en'))

Translator for various Languages

from textblob import TextBlob


word=TextBlob(input('Enter english text here :'))
z0=word.translate(from_lang='en',to ='es')
z1=word.translate(from_lang='en',to='fr')
z2=word.translate(from_lang='en',to='ru')
print ('\n',z0)
print ('\n',z1)
print ('\n',z2)

12
CHAPTER - 4
RESULT AND ACCURACY

4.1 Accuracy calculation


Some of the latest languages have been making use of a metric known as BLEU . It is a score
for determining how one candidate translation stack up against multiple reference translations.
Although it was developed to evaluate translations, it can be extended to also be used to evaluate text
generated when working on natural language processing tasks.

To calculate BLEU value

import nltk.translate.bleu_score as bleu

reference_translation=['Here You go.'.split(),]


candidate_translation_1='Here You go'.split()

print("BLEU Score: ",bleu.sentence_bleu(reference_translation, candidate_translation_1))

4.2 Result
Now that we have better understanding on how BLEU score works and we also have a plan
of action, it is time to start improving the translator. Bake Bot is just a prototype and there will be
some junkiness but it is ready to run and improve! The first tasks are to start with the clunkiness in
the existing code and turn it into a modular design which will makes it easy to extend and maintain.
And once the translator can run on its own, it's time to take it to the next level, adding more features
and improving the design.

13
CHAPTER-5

CONCLUSION AND FUTURE ENHANCMENT


5.1 Conclusion:

This project is aimed at bringing the translation power of Google translate to any website.
This project can translate any website containing English to another language of the user’s choice. It
is written in Python. It takes the input URL and performs the translation. It uses NLTK for text
processing and Google Translate forget the translation. It is a complete project that is ready to use.
This project has been tested by translating a few websites and all the output has been error free. This
project is helpful for people who are not good at English. They can now read any website in other
languages.
By Doing this Project

 Had gained Knowledge about Natural Language Processing (NLP).


 Had good experience working with Google Colaboratory.
 Had learned about basic concept of python language.
 Had gained an experience while doing research about Natural Language Processing
(NLP)

5.2 Future Enhancement:

In the future, this project could evolve to the next level by using Deep Learning and also use
algorithms like CNN, LSTM, RNN. And also put effort to do an efficient algorithm. Currently try to
collect datasets for the future use. And aimed to score 19 in BLEU. Also creating this translator as a
Mobile application may help user to translate anywhere also going to provide web page service too.

14
Reference:

[1] Yugi SUGIYAM, Koji TROLL, Mamoru FUJII “A processing system for program specifications
in Natural Language”
[2] Bill Z. Mannaric “Natural Language Processing Tools and Environments: The Field in
Perspective”
[3] Zhaorong Zong Changchun Hong “On Application of Natural Language Processing in Machine
Translation”
[4] Utsab Barman, Amitava Das , Joachim Wagner and Jennifer Foster “Code Mixing: A Challenge
for Language Identification in the Language of Social Media”
[5] Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky,
“Beyond English-Centric Multilingual Machine Translation”
[6] Hila Gonen1, Yoav Goldberg “Language Modeling for Code-Switching: Evaluation, Integration
of Monolingual Data, and Discriminative Training”
[7] Abhirut Gupta, Aditya Vavre and Sunita Sarawagi “Training Data Augmentation for Code-Mixed
Translation”
[8] Sergey Eduno, Myle Ott, Michael Auli and David Grangier “Understanding Back-Translation at
Scale”
[9] Roee Aharoni, Melvin Johnson and Orhan Firat “Massively Multilingual Neural Machine
Translation”
[10] Vivek Srivastava, Mayank Singh “PHINC: A Parallel Hinglish Social Media Code-Mixed
Corpus for Machine Translation”

15

You might also like