Yasho Blog: 2018

Sunday, December 2, 2018

Linguistic Divergence of Sinhala and Tamil languages in Machine Translation

As I am from IT background, my most of posts were related to technical stuffs. But this blog post mostly covers linguistic aspects than technical stuffs. While I am working on machine translation system between Sinhala and Tamil languages, I came up with an interesting term 'Language Divergence'. So I have written this blog to discuss about the study of the language divergence. I am going to explain the divergence study with the examples of Sinhala and Tamil languages as I am familiar with it. But why is this study important? Most of the people get this question in your mind now. When we built statistical machine translation system, it did not give perfect results for every sentence. So we have analysed the output and identified most of the errors occurred due to different nature of these languages. This is called as "Language Divergence". Formally, language divergence is is a common phenomenon in translation between two languages which occurs when sentences of the source language don’t translate into structurally similar sentences in the target language. Study of divergence is critical as differences in linguistic and extra-linguistic features in languages play pivotal roles in translation. This blog briefly explains the research we have done on the divergence study between the Sinhala and Tamil languages and an algorithm to classify the divergences in the parallel corpora.

The study by our team built on the Machine Translation Divergences concept introduced by Bonnie J. Dorr. In 1994, Dorr demonstrated a systematic solution to the divergence issues. Dorr classifies translation divergences into two broad types as syntactic divergence and lexical-semantic divergence. And, these main categorizations are further subcategorized into 7 subcategories. As many of the researches have focused on Lexical-Semantic divergence among two categories given in Dorr’s classification and Sinhala and Tamil languages are mostly show the structural convergence patterns, we too consider the subcategories of Lexical-semantic divergence.

Accordingly, this research has the twin aims of revisiting classification of divergence types as outlined by Dorr and outlining some of the new divergence patterns specific to Sinhala and Tamil languages. Since these two languages are considered as low resource, morphologically rich and highly inflected languages, these efforts gain more importance.

In this research, we propose a rule-based algorithm to classify a divergence. The results of the traditional SMT system are used here to identify the translation divergence. According to the Dorr’s classification, we have come up with rules to handle those divergences. The results of the language divergence were discussed with three linguistically capable people in both Tamil and Sinhala languages. Given an input Sinhala sentence and corresponding Tamil sentence, the proposed technique aims at recognizing the occurrence of divergence in the translation.

The methodology used to identify the divergence is shown in the Figure I.

Taking Dorr's classification, 5 types out of 7 are identified for Sinhala-to-Tamil translation. Some additional divergence types which do not fall under the Dorr’s classification are also identified for Sinhala-to-Tamil translations.

This research was presented in November at the International Conference on Asian Language Processing (IALP) in Bandung, Indoneasia. The authors discussed further exploratory analysis conducted using proposed technique. The examples for the each divergence are mentioned in the paper.

In concluding, this research focused only on the lexical semantic divergence of Sinhala and Tamil languages. However, syntactic divergence among Sinhala and Tamil languages should also be analyzed.

The slideset from the IALP presentation is here.

Saturday, December 1, 2018

Improving Phrase-Based Statistical Machine Translation with Preprocessing Techniques

From the previous post, I guess you are able to get knowledge about Statistical Machine Translation and the challenges in statistical machine translation. So this study is based on discussing about improvement to phrase-based statistical machine translation models which incorporates linguistic knowledge, namely parts-of-speech information and preprocessing techniques.

I and my supervisor experimented the improvement methods for Sinhala and Tamil language pair. Tamil and Sinhala languages which gain importance since both of them are acknowledged as official languages of Sri Lanka and also resource-poor languages. But the concept of pre-processing approach is language neutral and can be transcended to any other language pair with different parameters for the techniques.

We have conducted comparative study between State of the Art (SiTa) Vs. State of the Art with the pre-processing technique. Analysis revealed high percentage of improvement in BLEU score evaluation method by using preprocessing techniques. At the end, automatic evaluation of the system is performed by using BLEU as evaluation metrics. We observed all preprocessing techniques outperform the baseline system. The best performance is reported with PMI based chunking for Sinhala to Tamil translation. We could improve performance by 12% BLEU (3.56) using a small Sinhala to Tamil corpus with the help of proposed PMI based preprocessing. Notably, this increase is significantly higher compared to the increase shown by prior approaches for the same language pair.

In the paper, the authors posited that the study of improvement of phrase based statistical machine translation could thus be of great value in the translation between low resource languages where we don't have enough data to step forward into the neural machine translation. Any Statistical Machine Translation (SMT) System needs large parallel corpora for exact performance. So, non-availability of corpora limits the success achievable in machine translation to and from low resource languages.

To overcome the translation challenges in traditional SMT, preprocessing techniques are used. Preprocessing described in the research is related to

Generating phrasal units

Finding collocation words from PMI

Chunking the named entities

Chunking words on top of POS tagging

Parts of Speech (POS) integration

Segmentation

Preprocessing techniques can be classified into lexical, syntactic and semantic categories. In this study, we mainly focused on lexical and syntactic preprocessing approaches. As tokenization, punctuation removal, normalization and removing stop words are used in the traditional SMT to preprocess the sentences, this research mainly focused on utilizing phrasal units during tokenization.

We used factored modeling pointed out to integrate POS features into translation. Factored translation models can be defined as an extension to phrase-based models where every word is substituted by a vector of factors such as word, lemma, part-of-speech information, morphology, etc.

This research was presented in November at the International Conference on Asian Language Processing (IALP) in Bandung, Indonesia and will be published in IEEE.

The slideset from the IALP presentation is here.

Statistical Machine Translation

In this blog, I am going to discuss about the Statistical Machine Translation and the challenges in SMT to system.

Statistical machine translation

Statistical Machine Translation (SMT) is one of the corpus-based machine translation approaches. It is based on the statistical models that are built by analyzing the parallel corpus and monolingual corpus. The original idea of SMT was initiated by Brown et al. based on the Bayes Theorem. Basically, two probabilistic models are being used; Translation Model(TM) and Language Model (LM). The output is generated by maximizing the conditional probability for the target given the source language. SMT is simply described in the Figure 1. According to the translation given by human in parallel corpora, system learn the patterns and assigns probability for each translations. So according to the probability the best translation will be selected for the new translation.

Figure 2 shows the simplified block diagram of a Statistical Machine Translation using Translation model and Language model.

Translation Model (TM)

Translation system is capable of constructing the words that retrieve its original meaning and ordering those words in a sequence that form fluent sentences in the target language. The role of the translation model is to find P(t|e) the probability of the target sentence t given the input sentence e. The training corpus for the translation model is a sentence-aligned parallel corpus of the languages f and e.

It is obvious to compute P(t|e) from counts of the sentences t and e in the parallel corpus. Again, the problem is data sparsity. The solution that is immediately apparent is to find (or approximate) the sentence translation probability using the translation probabilities of the words in the sentences. The word translation probabilities, in turn, can be found from the parallel corpus. There is another issue is that the parallel corpus gives us only the sentence alignments; it does not tell us how the words in the sentences are aligned.

A word alignment between sentences tells us exactly how each word in sentence t is translated in e. The problem is getting the word alignment probabilities given a training corpus that is only sentence aligned. This problem is solved by using the Expectation-Maximization (EM) algorithm.

Current statistical machine translation is based on the perception that a better way to compute these probabilities is by considering the behavior of phrases. The perception of phrase-based statistical machine translation is to use phrases i.e., sequences of words as well as single words as the fundamental units of translation. In phrase-based translation model, the aim is to reduce the restrictions of word-based translation by translating whole sequences of words, where the lengths may differ. The sequences of words are called blocks or phrases, but typically are not linguistic phrases but phrases found using statistical methods from corpora.

Phrase-based models work in a successful manner only if the source and the target language have almost same in word order. The difference in the order of words in phrase-based models is handled by calculating distortion probabilities. Reordering is done by the phrase-based models.

Building Translation model

E.g.

මගේ නම ගීතා වේ. எனது பெயர் கீதா.

මගේ පොත. என்னுடைய புத்தகம்

The snippet of phrase table for the given parallel sentences is given below table.

Sinhala	Tamil	P(T\|E)
මගේ	எனது	0.66
මගේ	என்னுடைய	0.22
මගේ පොත	எனது புத்தகம்	0.72
මගේ නම ගීතා	எனது பெயர் கீதா	0.22

Language model

In general, the language model is used to ensure the fluency of the translated sentence. This plays a main role in the statistical approach as it picks the best fluent sentence with a high value of P(t) among all possible translations. The language model can be defined as the model which estimates and assigns a probability P(t) to the sentence, t. A high value will be assigned for the most fluent sentence and a low value for the least fluent sentence. The language model can be estimated from a monolingual corpus of the target language in the translation process. It gets the probability of each word according to the n-grams. Standardly it is calculated with a trigram language model.

Example, consider the following Tamil sentences,

ராம் பந்தை அடித்தான்

ராம் பந்தை வீசினான்

Even the second translation looks awkward to read, the probability assigned to the translation model to each sentence may be same, as translation model mainly concerns with producing the best output words for each word in the source sentence. But when the fluency and accuracy of the translation come into the picture, only the first translation of the given sentence is correct. This problem can be very well handled by the language models. This is because the probability assigned by the language model for the first sentence will be greater when compared with the other sentences. Table 3.2 shows the snippet of the language model.

Table 3.2 Snippet of the Language model

w3	w1w2	Score
அடித்தான்	ராம் பந்தை	-1.855783
வீசினான்	ராம் பந்தை	-0.4191293

The Statistical Machine Translation Decoder

The statistical machine translation decoder performs decoding which is the process of discovering a target translated sentence for a source sentence using translation model and language model. In general, decoding is a search problem that maximizes the translation and language model probability. Statistical machine translation decoders use best-first search based on heuristics. In other words, the decoder is responsible for the search of best translation in the space of possible translations. Given a translation model and a language model, the decoder constructs the possible translations and look for the most probable one. Beam search decoders use a heuristic search algorithm that explores a graph by expanding the most promising node in a limited set.

In the above figure decoding process of statistical machine translation is explained using Sinhala to Tamil translation. A Sinhala input sentence “මගේ ගම යාපනය වේ” is given to decoder. Decoder looks the probability of translation for words/phrases in the phrase table which is already built in the training process. According to the probabilities, it will create tree for all possible translations. In each step probability is multiplied. The highest probability path will be selected as best translation. In this case 0.62 is best path’s probability and it will be selected as best translation.

Compare to other methods, Statistical Machine Translation suits for low resource languages. The advantages of statistical approach over other machine translation approaches are as follows:

The enhanced usage of resources available for machine translation such as manually translated parallel and aligned texts of a language pair, books available in both languages and so on. That is, a large amount of machine-readable natural language texts is available with which this approach can be applied.
In general, statistical machine translation systems are language independent. i.e. it is not designed specifically for a pair of language.
Rule-based machine translation systems are generally expensive as they employ manual creation of linguistic rules and these systems cannot be generalized for other languages, whereas statistical systems can be generalized for any pair of languages if bilingual corpora for that particular language pair is available.
Translations produced by statistical systems are more natural compared to that of other systems, as it is trained from the real-time texts available from bilingual corpora and also the fluency of the sentence will be guided by a monolingual corpus of the target language

As we saw above, SMT system is so much better than other machine translation system, there are some challenges we need to overcome for the better translation.

Common challenges of SMT system

Out of Vocabulary: Some words in the source sentences are left as "not translated words" by the MT system since it is unknown to the translation model. The OOV can be categorized as named entities and inflection forms of verbs and nouns.
Reordering: Different languages have different word ordering (some languages have subject-object-verb while other have subject-verb-object). When translating, extra effort is needed to make sure that the output flow is fluent.
Word flow: Even though some languages accept free ordering when formulating sentences, according to the order of words, the meaning of sentences may differ. So, we have to be careful when translating from one language to another.
Unknown target word/words combination to the language model: When the word or sequence of words is unknown to the language model, the system suffers from constructing fluent output as it does not have sufficient statistic on selecting among the word choices.
The mismatch between the domain of the training data and the domain of interest: Writing style and the word usage has a radical difference from domain to domain. For example, the writing of official letters differs much from that of story writing. And the meaning of words may vary depending on the context or domain. For example, the word "cell" is translated to a "small part of the body" if the considered domain is medical while to "telephone" if the domain is computing
A multiword expression such as collocations and idioms: Translation of such multi-word expression is beyond the level of words. Therefore, in most cases, they are incorrectly translated.
Mismatches in the degree of inflection in of source and target languages: Each language has its own level of inflection and different morphological rules. Therefore, most of the time there will not be a one-to-one mapping between these inflections. This creates ambiguity while mapping inflection forms
Low resourced languages: Lacks of parallel data
Orthographical error - As the languages consist of more alphabets than the keyboard system, typing in those languages is a bit complex. In practical use, most of the time non-Unicode fonts are used in document processing, some time with local customization over the font. Though from the point of human reading usage, this makes no harm; this non-standardization in document processing makes it hard to produce linguistics resources for computer processing. In most cases, this conversion process creates orthographical errors in the data.

So we need to focus on improving SMT using pre/post processing techniques and linguistic information. My next blog will focus on improving SMT system using preprocessing techniques.