---
> [!meta]+ Metadata
> Author: Elia Torre
> Institute: [Halıcıoğlu Data Science Institute](https://datascience.ucsd.edu/), UC San Diego, La Jolla, CA
> Year: 2022
> Github: [Repository](https://github.com/EliaTorre/NLU-UCSD-NeuralMachineTranslation)
> Paper: [[An Applied Evolutionary Analysis of Neural Machine Translation - Elia Torre.pdf]]
---
> [!Abstract]
> *This repository presents the research undertaken in exploring three major leaps in the context of Neural Machine Translation. In order to do so, three milestone articles in the field of Neural Machine Translation are posed under analysis, i.e., [[Sequence to Sequence Learning with Neural Networks.pdf]], [[Neural Machine Translation by Jointly Learning to Align and Translate.pdf]] and [[Attention Is All You Need.pdf]]. The architectures described in the articles are implemented from scratch in the notebook associated with this repository and their performances are evaluated according to BLEU score on the German to English translation task based on "Multi30K" dataset.*
---
## Introduction
**Machine Translation (MT)** is a branch of computational linguistics that deals with the translation of text or speech from one natural language to another. MT is a complex task that involves many steps, including text analysis, language processing, and target language text generation. In recent years, the automatic translation of text from one language to another has experienced a major shift. **Statistical Machine Translation**, which relies on count-based models and has dominated the field for decades, has now been substituted by **Neural Machine Translation (NMT)**. NMT is a neural network-based approach to machine translation trained on large amounts of parallel text and can produce high-quality translations.
The objective of this project is to analyze and evaluate the results of three major leaps in architecture developments in the context of NMT. The architectures described in the articles are implemented from scratch in the notebook associated with this repository and their performances are evaluated according to **Perplexity metric** and **BLEU score**.
### Key Notes:
1. **Architecture**: The models implemented in the associated notebook may differ in network depth and weight initialisation due to computational power constraints or to adopt more recent implementations (e.g., BERT).
2. **Dataset**: The architectures' training and evaluation differ from the articles, using the "Multi30K" German-to-English dataset instead of the "WMT ’14" French-to-English dataset, aiming for novelty and computational efficiency.
---
## Dataset
The **Multi30k** dataset is a collection of 31,014 parallel English-German sentences for training and evaluating neural machine translation models. It includes train, validation, and test sets, along with a human-annotated English-German parallel corpus. This dataset extends the Flickr30K dataset, initially developed as an image-based dataset paired with English descriptions. Key characteristics include:
![[dataset.png|40%]]
---
## Pre-Processing
Pre-processing steps:
1. **Tokenization**: Load English and German "spacy" modules to tokenize input sequences. Append `<SOS>` and `<EOS>` tokens, converting text to lowercase.
2. **Splitting**: Use PyTorch's "dataset split" to create training (29k), validation (1k), and test (1k) sets. Specify German as the source language and English as the target.
3. **Vocabulary Creation**: Construct vocabularies from training data, including words appearing at least twice, using `<UNK>` for others.
4. **Iterator Creation**: Use "BucketIterator" with a batch size of 128 to map tokenized words to vocabulary indices.
---
## Implementations
### Sequence to Sequence Learning with Neural Networks
- **Encoder**: 2-layer LSTM with 512 cells and 256 word embeddings.
- **Decoder**: 4-layer LSTM with 512 cells and 256 word embeddings.
- **Parameters, Loss, & Optimizer**: ~14M, Cross-Entropy, ADAM.
- **Training Time**: 30s/epoch (NVIDIA Tesla K80) for 15 epochs.
- **BLEU Score**: ~14.5 (varies with random seed).
### Neural Machine Translation by Jointly Learning to Align and Translate
- **Encoder**: Bi-directional GRU with 512 units.
- **Attention**: Weighted sum of intermediate context vectors via tanh activation.
- **Decoder**: GRU with 512 hidden units.
- **Parameters, Loss, & Optimizer**: ~20.5M, Cross-Entropy, ADAM.
- **Training Time**: 57s/epoch (NVIDIA Tesla K80) for 10 epochs.
- **BLEU Score**: ~31.5 (varies with random seed).
### Attention Is All You Need
- **Encoder**:
- Multi-Head Attention: Parallel computation of Scaled-Dot Product Attention.
- Position-wise Feed-Forward: Fully connected network with ReLU activation.
- **Decoder**:
- Multi-Head Attention: Parallel computation of Scaled-Dot Product Attention.
- Position-wise Feed-Forward: Fully connected network with ReLU activation.
- **Parameters, Loss, & Optimizer**: ~9M, Cross-Entropy, ADAM.
- **Training Time**: 17s/epoch (NVIDIA Tesla K80) for 10 epochs.
- **BLEU Score**: ~35.7 (varies with random seed).
---
## Evaluation & Results
In the following table, the **BLEU scores** of the different implementations discussed are reported:
![[results.png|40%]]
---
## Acknowledgements
This project draws inspiration and code from the following repositories:
1. [farizrahman4u, "Seq2seq"](https://github.com/farizrahman4u/seq2seq)
2. [astorfi, "sequence-to-sequence-from-scratch"](https://github.com/astorfi/sequence-to-sequence-from-scratch)
3. [bentrevett, "pytorch-seq2seq"](https://github.com/bentrevett/pytorch-seq2seq)
4. [macournoyer, "neuralconvo"](https://github.com/macournoyer/neuralconvo)
5. [thomlake, "pytorch-attention"](https://github.com/thomlake/pytorch-attention)
6. [graykode, "nlp-tutorial"](https://github.com/graykode/nlp-tutorialh)
7. [Nick-Zhao-Engr, "Machine-Translation"](https://github.com/Nick-Zhao-Engr/Machine-Translation)
8. [jadore801120, "attention-is-all-you-need-pytorch"](https://github.com/jadore801120/attention-is-all-you-need-pytorch)
9. [sooftware, "attentions"](https://github.com/sooftware/attentions)