Neural Machine Translation for Low Resource Languages

welocalize August 10, 2022

Neural Machine Translation (NMT) systems have reached state-of-the art performance in many language pairs making researchers increasingly interested in whether current systems can compete with human translators in terms of performance.

The datasets used to train modern MT systems generally contain a vast amount of bilingual sentences. However, not all language pairs have datasets of this size available. For a considerable number of languages worldwide, the amount of available data is relatively small or even non-existent. Low resource languages are those that have relatively less data available for training.

Eirini and Jon at Nettt event 2022

At NeTTT Conference 2022, academics and practitioners in linguistics, translation, machine translation (MT) and natural language processing (NLP) came together to share research, ideas, and concepts. Eirini Zafeiridou and Jon Cambra (pictured above at NeTT and below), Welocalize AI and Machine Learning experts delivered their presentation, NMT for Low Resource Languages, sharing knowledge and research on this popular topic.

Eirini and Jon's headshots

In this blog, Eirini and Jon recap some of the key points from their presentation:

To mitigate the limitation of data, recent studies explore various techniques in the field of low resource NMT:

Data Augmentation

Data augmentation for low resource language content is an extended technique generating synthetic segment pairs based on the already available data. This is possible by applying some transformation to the original segment pairs and once transformed, include them as new data entries.

Data augmentation is well studied in different Machine Learning areas but less in NMT because the transformation requires production of fluent and meaningful segments, and to preserve alignment between source and target. However, initial techniques would inject noise on both sides by removing words or swapping words which can improve NMT robustness but also can break the NMT system.

The most recent work explores the idea of replacing words as a type of transformation. Indeed, here are some of the techniques studied in recent years:

  • Replacing common words with rare words. This provides new context to low-frequency words which are the most difficult ones to translate for an NMT system
  • Random replacement of words in source and target
  • Using a Language Model (LM):
    • to mask and replace a word by another word
    • to mask and replace a word by its probability distribution over the vocabulary
    • to look for segments with words in similar context as an unseen word and apply the replacement to increase the frequency of such terms in the training data
  • Replacement with dictionary terms
  • Use of syntax aware replacement that uses syntactic rules to select not very important words to replace

Another scenario of low resource language can be the availability of only one side of the parallel data such as the target content. In that case, a common technique to produce the missing side with an NMT system is back-translation. Such a process implies noisier parallel data because the produced data will not be perfect since it relies on the quality of the NMT system used to produce the missing side. The resulting data quality will also depend on other surrounding factors such as the type of NMT system to train or the amount of original and synthetic data.

Assuming the hypothesis that a better system can be produced with back-translated data, we reach the procedure of iterative back-translation which repeats back-translation to produce better systems.

Using Monolingual Data

We can try to benefit from just using monolingual data for NMT training. On the one hand, we can make use of word embedding models trained on the available monolingual data to initialize the embedding layer of the NMT system which can be easily learned during NMT with high resource settings.

On the other hand, there is the possibility to incorporate a Language Model (LM) such as BERT, GPT or ELMO inside an NMT system with different methods:

  • Initializing the NMT model encoder, decoder or both with a LM
  • Shallow fusion: the LM is used to score the generated tokens during translation
  • Deep fusion: the LM is concatenated to the decoder in order to apply constraints forcing the token distribution returned by the NMT to be like the LM distribution

Transfer Learning

Humans have the ability to use prior experience when learning a new skill. This makes it easier to acquire new knowledge and abilities quickly and more efficiently. On the other hand, Machine Learning algorithms typically learn a new task based on random initialization using a default set of data without any previous knowledge.

Transfer learning is a subfield of Machine Learning that uses the knowledge already acquired from a given task by transferring it to another similar task.  It is extensively used in NMT where instead of training multiple individual models, an NMT model can be trained on high-resource languages and then used to initialize a smaller child model trained on low-resource language pairs using a much more limited dataset.

Transfer learning can be divided into two main training categories. In the first warm-start scenario, we are aware of the child dataset before training, and we can consequently prepare the parent model appropriately for the transfer process. In the cold-start scenario, on the other hand, we are not aware of the child languages or dataset, and consequently we cannot prepare the parent model accordingly. Thus, we need to perform the appropriate modifications before transferring any parameters to the child model (for instance merging/ updating the model’s vocab).

There have been proposed many additional methods over the last years that can be effectively utilized in transfer learning settings. Some of them highlight the importance of having a strong parent model, while others indicate the significance of using similar languages, with word overlap, same alphabet,  a shared vocab or cross-lingual embeddings. Some others, introduce multi-stage transfer learning or use a pivot language as an intermediate step during the translation process. Finally, in many cases transfer learning can be used simultaneously with other proposed methods such as multilingual NMT to achieve higher performance.

Multilingual NMT

Multilingual NMT systems are those that translate between more than one language pair and they can effectively be also used in low resource settings.

In terms of data, a multilingual model can translate:

  • from multiple source to one target language
  • from one source to multiple target languages
  • from multiple source to multiple target languages

In terms of the model’s architecture, a multilingual NMT system can have language dependent, partially or fully shared components (such as encoders, decoders, vocabs & attention mechanisms)

Before developing any such system we need to consider what languages are going to be used. Many researchers support the importance of having languages with common morphosyntactic patterns which sometimes may even belong to the same language family. Other studies highlight the significance of using languages with similar writing scripts or applying transliteration and data transformation techniques to tackle with such differences. Similar to transfer learning techniques, sharing vocabs and using cross-lingual embeddings has also be proven to be an effective strategy in multilingual NMT settings as well. Sufficient representation across all languages is also considered to be critical. In case of data imbalance, using weighted datasets, augmenting data or oversampling/ undersampling certain languages may help. Scaling up the model’s capacity needs also to be carefully considered in terms of the required training time, costs and available hardware or data resources.

Overall, there have been many proposed interesting methods for low resource NMT. However, its not easy to recommend a best-fit solution as it significantly depends on the type, quantity, and quality of the available data. Perhaps, combining some of the aforementioned techniques would yield better results.

And’s a photo of the Welocalize AI Engineering Team at NeTTT in Rhodes!!

For more information on language and translation technology solutions, connect with the Welocalize team here.