"Good evening, everyone, we are Group 9 and I am excited to present our project on 'Static
Idiom Integrated Machine Translation,' focusing on improving the accuracy of translations
between English and Vietnamese. I. Introduction - Accurately translating between English and Vietnamese presents significant challenges due to substantial linguistic differences. These challenges include differing grammar structures, word orders, and especially the use of idiomatic expressions, which often do not have direct translations and require contextual understanding to translate correctly. - Moreover, deploying robust model on mobile or integrated devices like CPU also poses a problem due to the limitation of computational resources. - To address these challenges, we have implemented two primary approaches: 1. Utilizing a pre-trained T5-en-vi mode, which was pre-trained on a large corpus of bilingual texts, and we will enhance it specifically for idiomatic cases. 2. Speeding up model execution through quantization techniques. II. Basic Architectures So before moving to the details of out approaches, I will make a brief overview of the basic model structures used in our project. 1. Seq2Seq A seq2seq model is composed of an encoder and a decoder that typically implemented as RNNs. - The encoder is responsible for processing the input sequence and capturing its essential information, which is stored as the hidden state of the network and, in a model with attention mechanism, a context vector. The context vector is the weighted sum of the input hidden states and is generated for every time instance in the output sequences. - The decoder takes the context vector and hidden states from the encoder and generates the final output sequence. At each step, it considers the previously generated elements, the context vector, and the input sequence information to make predictions for the next element in the output sequence. - The attention mechanism enables the model to selectively focus on different parts of the input sequence during the decoding process. At each decoder step, an alignment model calculates the attention score using the current decoder state and all of the attention hidden vectors as input. 2. T5 model T5, or Text-to-Text Transfer Transformer, is a Transformer based architecture that uses a text-to- text approach. Encoder The input tokens are first converted into vectors using the input embedding layer. Positional encoding is added to these embeddings, and then they pass through multiple encoder layers, each consisting of a multi-head attention mechanism, followed by a residual connection with layer normalization, and a feed-forward network with another residual connection and normalization. Decoder Output tokens are similarly converted into vectors through the output embedding layer, with positional encoding added. These embeddings pass through several decoder layers, each starting with masked multi-head attention, followed by residual connections with layer normalization, multi-head attention focusing on the encoder’s output, another set of residual connections and normalization, and a feed-forward network with final residual connections and normalization. Output Processing The final output from the decoder is projected through a linear layer to transform it into the desired output dimension. A softmax function is then applied to convert these projections into probability distributions over the output vocabulary. This architecture’s unique strength is its Flexible Text-to-Text Format: It can handle various tasks by converting them all into a text-to-text format.