Saturday, December 1, 2018

Improving Phrase-Based Statistical Machine Translation with Preprocessing Techniques

From the previous post, I guess you are able to get knowledge about Statistical Machine Translation and the challenges in statistical machine translation. So this study is based on discussing about  improvement to phrase-based statistical machine translation models which incorporates linguistic knowledge, namely parts-of-speech information and preprocessing techniques.

I and my supervisor experimented the improvement methods for Sinhala and Tamil language pair. Tamil and Sinhala languages which gain importance since both of them are acknowledged as official languages of Sri Lanka and also resource-poor languages. But the concept of pre-processing approach is language neutral and can be transcended to any other language pair with different parameters for the techniques.

We have conducted comparative study between State of the Art (SiTa) Vs. State of the Art with the pre-processing technique. Analysis revealed high percentage of improvement in BLEU score evaluation method by using preprocessing techniques.  At the end, automatic evaluation of the system is performed by using BLEU as evaluation metrics. We observed all preprocessing techniques outperform the baseline system. The best performance is reported with PMI based chunking for Sinhala to Tamil translation. We could improve performance by 12% BLEU (3.56) using a small Sinhala to Tamil corpus with the help of proposed PMI based preprocessing. Notably, this increase is significantly higher compared to the increase shown by prior approaches for the same language pair.

In the paper, the authors posited that the study of improvement of phrase based statistical machine translation could thus be of great value in the translation between low resource languages where we don't have enough data to step forward into the neural machine translation. Any Statistical Machine Translation (SMT) System needs large parallel corpora for exact performance. So, non-availability of corpora limits the success achievable in machine translation to and from low resource languages.

To overcome the translation challenges in traditional SMT, preprocessing techniques are used. Preprocessing described in the research is related to

  • Generating phrasal units
    • Finding collocation words from PMI
    • Chunking the named entities 
    • Chunking words on top of POS tagging
  • Parts of Speech (POS) integration 
  • Segmentation


Preprocessing techniques can be classified into lexical, syntactic and semantic categories. In this study, we mainly focused on lexical and syntactic preprocessing approaches. As tokenization, punctuation removal, normalization and removing stop words are used in the traditional SMT to preprocess the sentences, this research mainly focused on utilizing phrasal units during tokenization.

We used factored modeling pointed out to integrate POS features into translation. Factored translation models can be defined as an extension to phrase-based models where every word is substituted by a vector of factors such as word, lemma, part-of-speech information, morphology, etc.
This research was presented in November at the International Conference on Asian Language Processing (IALP) in Bandung, Indonesia and will be published in IEEE.

The slideset from the IALP presentation is here.

No comments:

Post a Comment