Yasho Blog

Tuesday, June 21, 2022

Data privacy? What is mean to me and you?

Finally, I decided to restart my blogging habit after four years of procrastination. This blog is mainly related to my thoughts on privacy. What does it mean to me? One phrase coming to my mind when I think about privacy is 'my research topic'😉. But is that all? The privacy I am talking about here is digital data privacy (not real-world privacy where the state of being alone, or the right to keep one's personal matters and relationships secret; maybe this can be applied in digital format)

When I start to read materials for my research, I could gain different perspectives on privacy. I started to behave like a 16-year-old me (first year of high school) when I applied physics in every real-world scenarios (such as why is it easy for me to move the door from the edge than near the door joint). I was a curious kid who had a lot of questions about everything. I want to know the answers to every question. I still remember a question that I asked when I was in grade 4; why carbohydrate digestion begins in the mouth while other metabolism starts in the stomach. Still, I do not have a convincing answer for that stuck question.

Likewise, when I learn about data privacy, the privacy shield capsulated me and started to restrict me from sharing the data or adopting any IoT devices. I always ask myself what will happen if I post this photo on Facebook, what will happen if I search for something internet? how did they find that I am interested in buying winter cloth (through advertisements) when I updated my location on Facebook?

Though I could not avoid all these nightmares as my 3/4 of the day is with laptops/mobile phones/ smartwatches and generating digital data. So, here I wrote some head-eating questions of mine and non-technical answers as a friend who is concerned about privacy.

Is that mean a service provider like Google/Facebook (Meta) knows me more than I know myself?

Obviously, the service providers collect many data about us. We can describe it as an invisible person watching all our activities which we do consciously or unconsciously. The algorithms embedded on the applications try to learn our behaviour and propose something personalised to us; act as if they have minds of their own. However, all those recommendations are not accurate and a user may deliberately hate the recommendation as well. For example, after buying a heater, if you see your wall full of heater advertisements, you may feel annoyed. Another example: an application can store your credit card details on their end (or hacked by attackers from a service provider) and buy some items automatically without your knowledge until it arrives at your home door (it's worst if it goes to someone else's home door 😝). In this way, applications may fall into an uncanny valley, where they are intelligent enough to learn the user behaviour, and trigger some events/social behaviours in users, but still machine-like enough to create dissonance by doing something I don't want or prefer.

What is data privacy?

A common definition of data privacy from GDPR (privacy law for European countries) is “the ability of an individual to control when with whom to share the data for what purpose ”. Privacy for me is that I should have the privilege to share my data with the people I want to share and it should be transparent to me (what is happening after I post something on Facebook). I don't want to be hidden by a black-box without knowing the analysis/process/progress of my data.

Most of the time we are confused by the terms privacy and security. Service-providing companies mostly focus on security than privacy. Security means protecting users' data from being hacked or stolen. For example, password authentication is one of the security methods to prevent attackers to use your data. However, privacy should be decided by the user via explicitly specifying what data should be shared and with whom (even with the service provider) and a transparent data process/movement.

What are the purposes for collecting my data?

Service providers collect the data for multiple purposes such as advertisement, selling to other applications, health and safety, entertainment, personalisation, commercial, research, and convenience. Some of the purposes are beneficial for both users (actual data owners) and service providers while others are beneficial only for service providers. Generally, service providers analyse the data on their end (without selling to other applications) to improve the personalised view/recommendations/convenience for the users or research (can transfer to other services like government/university also for their research)/improve their product quality or build/recommend their new products. Some of the data are collected for health safety purposes (such as Google map location data are sometimes utilised to alert somebody when someone is in danger, or medical data in a hospital to share/analysed by medical practitioners), entertainment (such as recommending your favourite shows in YouTube), and commercial (showing advertisement/ financial transactions).

However, though we are actual data owners, we do not have the privilege to know the purpose of collecting our data and controlling it. We can just view the purposes in the 'accept terms and conditions' document if we read it carefully (which is the one we usually don't read and just accept everything).

What can a user do if they want to protect their privacy in the current environment?

In practice, from the user’s viewpoint, configuring privacy preferences (or privacy settings) is the primary means of managing privacy. To date, many applications require users to explicitly follow and regulate data-sharing preferences with other services. More data/more applications mean more privacy responsibility for the users. Unfortunately, the complexity of such settings, coupled with a lack of technical knowledge, results in many users simply choosing the default settings.

However, many studies have statistically shown the inadequacy of the current manual privacy settings and user satisfaction in those settings.

So, what can be the solution to this problem? This is my starting point of research about how can I assist end-user in ensuring their privacy?

Sunday, December 2, 2018

Linguistic Divergence of Sinhala and Tamil languages in Machine Translation

As I am from IT background, my most of posts were related to technical stuffs. But this blog post mostly covers linguistic aspects than technical stuffs. While I am working on machine translation system between Sinhala and Tamil languages, I came up with an interesting term 'Language Divergence'. So I have written this blog to discuss about the study of the language divergence. I am going to explain the divergence study with the examples of Sinhala and Tamil languages as I am familiar with it. But why is this study important? Most of the people get this question in your mind now. When we built statistical machine translation system, it did not give perfect results for every sentence. So we have analysed the output and identified most of the errors occurred due to different nature of these languages. This is called as "Language Divergence". Formally, language divergence is is a common phenomenon in translation between two languages which occurs when sentences of the source language don’t translate into structurally similar sentences in the target language. Study of divergence is critical as differences in linguistic and extra-linguistic features in languages play pivotal roles in translation. This blog briefly explains the research we have done on the divergence study between the Sinhala and Tamil languages and an algorithm to classify the divergences in the parallel corpora.

The study by our team built on the Machine Translation Divergences concept introduced by Bonnie J. Dorr. In 1994, Dorr demonstrated a systematic solution to the divergence issues. Dorr classifies translation divergences into two broad types as syntactic divergence and lexical-semantic divergence. And, these main categorizations are further subcategorized into 7 subcategories. As many of the researches have focused on Lexical-Semantic divergence among two categories given in Dorr’s classification and Sinhala and Tamil languages are mostly show the structural convergence patterns, we too consider the subcategories of Lexical-semantic divergence.

Accordingly, this research has the twin aims of revisiting classification of divergence types as outlined by Dorr and outlining some of the new divergence patterns specific to Sinhala and Tamil languages. Since these two languages are considered as low resource, morphologically rich and highly inflected languages, these efforts gain more importance.

In this research, we propose a rule-based algorithm to classify a divergence. The results of the traditional SMT system are used here to identify the translation divergence. According to the Dorr’s classification, we have come up with rules to handle those divergences. The results of the language divergence were discussed with three linguistically capable people in both Tamil and Sinhala languages. Given an input Sinhala sentence and corresponding Tamil sentence, the proposed technique aims at recognizing the occurrence of divergence in the translation.

The methodology used to identify the divergence is shown in the Figure I.

Taking Dorr's classification, 5 types out of 7 are identified for Sinhala-to-Tamil translation. Some additional divergence types which do not fall under the Dorr’s classification are also identified for Sinhala-to-Tamil translations.

This research was presented in November at the International Conference on Asian Language Processing (IALP) in Bandung, Indoneasia. The authors discussed further exploratory analysis conducted using proposed technique. The examples for the each divergence are mentioned in the paper.

In concluding, this research focused only on the lexical semantic divergence of Sinhala and Tamil languages. However, syntactic divergence among Sinhala and Tamil languages should also be analyzed.

The slideset from the IALP presentation is here.

Saturday, December 1, 2018

Improving Phrase-Based Statistical Machine Translation with Preprocessing Techniques

From the previous post, I guess you are able to get knowledge about Statistical Machine Translation and the challenges in statistical machine translation. So this study is based on discussing about improvement to phrase-based statistical machine translation models which incorporates linguistic knowledge, namely parts-of-speech information and preprocessing techniques.

I and my supervisor experimented the improvement methods for Sinhala and Tamil language pair. Tamil and Sinhala languages which gain importance since both of them are acknowledged as official languages of Sri Lanka and also resource-poor languages. But the concept of pre-processing approach is language neutral and can be transcended to any other language pair with different parameters for the techniques.

We have conducted comparative study between State of the Art (SiTa) Vs. State of the Art with the pre-processing technique. Analysis revealed high percentage of improvement in BLEU score evaluation method by using preprocessing techniques. At the end, automatic evaluation of the system is performed by using BLEU as evaluation metrics. We observed all preprocessing techniques outperform the baseline system. The best performance is reported with PMI based chunking for Sinhala to Tamil translation. We could improve performance by 12% BLEU (3.56) using a small Sinhala to Tamil corpus with the help of proposed PMI based preprocessing. Notably, this increase is significantly higher compared to the increase shown by prior approaches for the same language pair.

In the paper, the authors posited that the study of improvement of phrase based statistical machine translation could thus be of great value in the translation between low resource languages where we don't have enough data to step forward into the neural machine translation. Any Statistical Machine Translation (SMT) System needs large parallel corpora for exact performance. So, non-availability of corpora limits the success achievable in machine translation to and from low resource languages.

To overcome the translation challenges in traditional SMT, preprocessing techniques are used. Preprocessing described in the research is related to

Generating phrasal units

Finding collocation words from PMI

Chunking the named entities

Chunking words on top of POS tagging

Parts of Speech (POS) integration

Segmentation

Preprocessing techniques can be classified into lexical, syntactic and semantic categories. In this study, we mainly focused on lexical and syntactic preprocessing approaches. As tokenization, punctuation removal, normalization and removing stop words are used in the traditional SMT to preprocess the sentences, this research mainly focused on utilizing phrasal units during tokenization.

We used factored modeling pointed out to integrate POS features into translation. Factored translation models can be defined as an extension to phrase-based models where every word is substituted by a vector of factors such as word, lemma, part-of-speech information, morphology, etc.

This research was presented in November at the International Conference on Asian Language Processing (IALP) in Bandung, Indonesia and will be published in IEEE.

The slideset from the IALP presentation is here.

Statistical Machine Translation

In this blog, I am going to discuss about the Statistical Machine Translation and the challenges in SMT to system.

Statistical machine translation

Statistical Machine Translation (SMT) is one of the corpus-based machine translation approaches. It is based on the statistical models that are built by analyzing the parallel corpus and monolingual corpus. The original idea of SMT was initiated by Brown et al. based on the Bayes Theorem. Basically, two probabilistic models are being used; Translation Model(TM) and Language Model (LM). The output is generated by maximizing the conditional probability for the target given the source language. SMT is simply described in the Figure 1. According to the translation given by human in parallel corpora, system learn the patterns and assigns probability for each translations. So according to the probability the best translation will be selected for the new translation.

Figure 2 shows the simplified block diagram of a Statistical Machine Translation using Translation model and Language model.

Translation Model (TM)

Translation system is capable of constructing the words that retrieve its original meaning and ordering those words in a sequence that form fluent sentences in the target language. The role of the translation model is to find P(t|e) the probability of the target sentence t given the input sentence e. The training corpus for the translation model is a sentence-aligned parallel corpus of the languages f and e.

It is obvious to compute P(t|e) from counts of the sentences t and e in the parallel corpus. Again, the problem is data sparsity. The solution that is immediately apparent is to find (or approximate) the sentence translation probability using the translation probabilities of the words in the sentences. The word translation probabilities, in turn, can be found from the parallel corpus. There is another issue is that the parallel corpus gives us only the sentence alignments; it does not tell us how the words in the sentences are aligned.

A word alignment between sentences tells us exactly how each word in sentence t is translated in e. The problem is getting the word alignment probabilities given a training corpus that is only sentence aligned. This problem is solved by using the Expectation-Maximization (EM) algorithm.

Current statistical machine translation is based on the perception that a better way to compute these probabilities is by considering the behavior of phrases. The perception of phrase-based statistical machine translation is to use phrases i.e., sequences of words as well as single words as the fundamental units of translation. In phrase-based translation model, the aim is to reduce the restrictions of word-based translation by translating whole sequences of words, where the lengths may differ. The sequences of words are called blocks or phrases, but typically are not linguistic phrases but phrases found using statistical methods from corpora.

Phrase-based models work in a successful manner only if the source and the target language have almost same in word order. The difference in the order of words in phrase-based models is handled by calculating distortion probabilities. Reordering is done by the phrase-based models.

Building Translation model

E.g.

මගේ නම ගීතා වේ. எனது பெயர் கீதா.

මගේ පොත. என்னுடைய புத்தகம்

The snippet of phrase table for the given parallel sentences is given below table.

Sinhala	Tamil	P(T\|E)
මගේ	எனது	0.66
මගේ	என்னுடைய	0.22
මගේ පොත	எனது புத்தகம்	0.72
මගේ නම ගීතා	எனது பெயர் கீதா	0.22

Language model

In general, the language model is used to ensure the fluency of the translated sentence. This plays a main role in the statistical approach as it picks the best fluent sentence with a high value of P(t) among all possible translations. The language model can be defined as the model which estimates and assigns a probability P(t) to the sentence, t. A high value will be assigned for the most fluent sentence and a low value for the least fluent sentence. The language model can be estimated from a monolingual corpus of the target language in the translation process. It gets the probability of each word according to the n-grams. Standardly it is calculated with a trigram language model.

Example, consider the following Tamil sentences,

ராம் பந்தை அடித்தான்

ராம் பந்தை வீசினான்

Even the second translation looks awkward to read, the probability assigned to the translation model to each sentence may be same, as translation model mainly concerns with producing the best output words for each word in the source sentence. But when the fluency and accuracy of the translation come into the picture, only the first translation of the given sentence is correct. This problem can be very well handled by the language models. This is because the probability assigned by the language model for the first sentence will be greater when compared with the other sentences. Table 3.2 shows the snippet of the language model.

Table 3.2 Snippet of the Language model

w3	w1w2	Score
அடித்தான்	ராம் பந்தை	-1.855783
வீசினான்	ராம் பந்தை	-0.4191293

The Statistical Machine Translation Decoder

The statistical machine translation decoder performs decoding which is the process of discovering a target translated sentence for a source sentence using translation model and language model. In general, decoding is a search problem that maximizes the translation and language model probability. Statistical machine translation decoders use best-first search based on heuristics. In other words, the decoder is responsible for the search of best translation in the space of possible translations. Given a translation model and a language model, the decoder constructs the possible translations and look for the most probable one. Beam search decoders use a heuristic search algorithm that explores a graph by expanding the most promising node in a limited set.

In the above figure decoding process of statistical machine translation is explained using Sinhala to Tamil translation. A Sinhala input sentence “මගේ ගම යාපනය වේ” is given to decoder. Decoder looks the probability of translation for words/phrases in the phrase table which is already built in the training process. According to the probabilities, it will create tree for all possible translations. In each step probability is multiplied. The highest probability path will be selected as best translation. In this case 0.62 is best path’s probability and it will be selected as best translation.

Compare to other methods, Statistical Machine Translation suits for low resource languages. The advantages of statistical approach over other machine translation approaches are as follows:

The enhanced usage of resources available for machine translation such as manually translated parallel and aligned texts of a language pair, books available in both languages and so on. That is, a large amount of machine-readable natural language texts is available with which this approach can be applied.
In general, statistical machine translation systems are language independent. i.e. it is not designed specifically for a pair of language.
Rule-based machine translation systems are generally expensive as they employ manual creation of linguistic rules and these systems cannot be generalized for other languages, whereas statistical systems can be generalized for any pair of languages if bilingual corpora for that particular language pair is available.
Translations produced by statistical systems are more natural compared to that of other systems, as it is trained from the real-time texts available from bilingual corpora and also the fluency of the sentence will be guided by a monolingual corpus of the target language

As we saw above, SMT system is so much better than other machine translation system, there are some challenges we need to overcome for the better translation.

Common challenges of SMT system

Out of Vocabulary: Some words in the source sentences are left as "not translated words" by the MT system since it is unknown to the translation model. The OOV can be categorized as named entities and inflection forms of verbs and nouns.
Reordering: Different languages have different word ordering (some languages have subject-object-verb while other have subject-verb-object). When translating, extra effort is needed to make sure that the output flow is fluent.
Word flow: Even though some languages accept free ordering when formulating sentences, according to the order of words, the meaning of sentences may differ. So, we have to be careful when translating from one language to another.
Unknown target word/words combination to the language model: When the word or sequence of words is unknown to the language model, the system suffers from constructing fluent output as it does not have sufficient statistic on selecting among the word choices.
The mismatch between the domain of the training data and the domain of interest: Writing style and the word usage has a radical difference from domain to domain. For example, the writing of official letters differs much from that of story writing. And the meaning of words may vary depending on the context or domain. For example, the word "cell" is translated to a "small part of the body" if the considered domain is medical while to "telephone" if the domain is computing
A multiword expression such as collocations and idioms: Translation of such multi-word expression is beyond the level of words. Therefore, in most cases, they are incorrectly translated.
Mismatches in the degree of inflection in of source and target languages: Each language has its own level of inflection and different morphological rules. Therefore, most of the time there will not be a one-to-one mapping between these inflections. This creates ambiguity while mapping inflection forms
Low resourced languages: Lacks of parallel data
Orthographical error - As the languages consist of more alphabets than the keyboard system, typing in those languages is a bit complex. In practical use, most of the time non-Unicode fonts are used in document processing, some time with local customization over the font. Though from the point of human reading usage, this makes no harm; this non-standardization in document processing makes it hard to produce linguistics resources for computer processing. In most cases, this conversion process creates orthographical errors in the data.

So we need to focus on improving SMT using pre/post processing techniques and linguistic information. My next blog will focus on improving SMT system using preprocessing techniques.