Tuesday, April 18, 2017

Install Moses in Ubuntu

In this blog, We will mainly focus on installing Moses and data processing tools in Ubuntu Operating System. We need to install some other packages before installing Moses. We will see those also in this blog.

Before we start, we will make sure that we have installed those packages.

g++
git
subversion
automake
libtool
zlib1g-dev
libboost-all-dev
libbz2-dev
liblzma-dev
python-dev
libtcmalloc-minimal4


If you have not install  above packages you can install using below command.

sudo apt-get install <package name>

g++ and boost is needed for compile Moses. Already we could install boost by above command. So below we will see how to install boost.

Installing Boost

For that, we need to download the boost. You can use wget command to download the boost. If you have any trouble to download it, you can straightly download the latest version of the boost from https://sourceforge.net/projects/boost/files/boost/. After you download boost<version>.tar.gz, 
you can extract it using the following command.

tar zxvf boost<version>.tar.gz
Then go inside the boost folder and u need to start the script.

cd boost<version>/ 
./bootstrap.sh 
./b2 -j5 --prefix=$PWD --libdir=$PWD/lib64 --layout=tagged link=static threading=multi,single install || echo FAILURE

This creates library file in the directory lib64, NOT in the system directory.

Note: In the last command " -j5 " indicates my PC is 5 Core machine (i.e my processor is CORE I5 ) If you are using different core machine change it in your core value.

Installing Moses

For installing Moses, you need to clone it from the GitHub. That is why we installed git in our system.

You can clone the Moses from this git hub link https://github.com/moses-smt/mosesdecoder by below code.
git clone https://github.com/moses-smt/mosesdecoder.git
cd mosesdecoder/ 
Then you can compile Moses using

make -f contrib/Makefiles/install-dependencies.gmake 
./compile.sh

Installing Word Alignment tool

Moses requires a word alignment tool, such as giza++, mgiza, or Fast Align. Here I am going to mention about installing GIZA++ and mgiza. You can select what you want to use for word alignment. So you can install one of them.

  • Installing GIZA++
You can clone GIZA++ from https://github.com/moses-smt/giza-pp.
Untar the package in the folder you wish to install GIZA++.
tar zxvf  giza-pp
cd giza-pp 
make

If you copy the GIZA ++ into theMosesdecoder tools package, it is easy when you are training the system afterward.

cd ~/mosesdecoder mkdir tools
cp ~/giza-pp/GIZA++-v2/GIZA++ ~/giza-pp/GIZA++-v2/snt2cooc.out \ ~/giza-pp/mkcls-v2/mkcls tools

  •  Installing MGIZA
You can clone MGIZA from https://github.com/moses-smt/mgiza.

Untar the package in the folder you wish to install MGIZA.

cd mgiza/mgizapp
cmake . $ make $ make install
make 
make install

It will take some time to install, so you can take rest for some time.

Installing IRSTLM

You can create language model using IRSTLM. Language model toolkits perform two main tasks: training and querying. You can train a language model with any of them, produce an ARPA file, and query with a different one. To train a model, just call the relevant script.
If you want to use SRILM or IRSTLM to query the language model, then they need to be linked
with Moses.

You need to download IRSTLM from http://sourceforge.net/projects/irstlm/

tar zxvf irstlm-<version>.tgz
cd irstlm-<version>
./regenerate-makefiles.sh
./configure --prefix=$HOME/irstlm-<version>
make install



Fine, Now we have installed Moses and related tools. Now we are ready to do baseline system. In the next blog, we will see how to build a baseline system for Tamil to Sinhala translation.
 

Sunday, April 16, 2017

Introduction to Moses

Before coming to the Moses we need to know brief introduction to Natural Language Processing and Language translation. Then we could understand the Moses easily.

Natural Language Processing

Natural Language Processing is an Artificial Intelligence method which is used to communicate with Intelligent system such as Computers using natural language such as English, Tamil and Sinhala. By utilizing NLP, developers can organize and structure knowledge to perform tasks such as automatic summarization, translation, named entity recognition, relationship extraction, sentiment analysis, speech recognition, and topic segmentation. NLP considers the hierarchical structure of language: several words make a phrase, several phrases make a sentence. NLP is commonly used for text mining, machine translation, and automated question answering.

We can use NLP to translate from one language to another language. Instead of hand-coding large sets of rules, NLP can rely on machine learning to automatically learn these rules by analyzing a set of examples (i.e. a large corpus) and making a statistical inference.

Moses

Moses is a statistical machine translation system which allows you to translate from one language to another language by training translation models. For training the model you need collection of translated texts in both language (parallel corpus). Once you have a trained model, an efficient search algorithm quickly finds the highest probability translation among the exponential number of choices. It is a data driven machine translation approach. Moses system based on the Bayes theorem. 

If we explain this in translation point of view, probability of translation from f language to e language is depend on probability of translation from e language to f language and probability of  e language.

Further the system can be drilled down as log linear models.


  • Weight-t Translation
  • Weight-l Language model
  • Weight-d distortion (reordering)
  • Weight-w word penalty

Moses was developed in C++ for efficiency and followed modular, object-oriented design.The toolkit is a complete out-of-the-box translation system for academic research. It consists of all the components needed to preprocess data, train the language models and the translation models. It also contains tools for tuning these models using minimum error rate training and evaluating the resulting translations using the BLEU score.
 

 Moses requires two main things
  • Parallel text- Collection of sentences in two different languages, which is sentence-aligned, each sentence in one language is matched with its corresponding translated sentence in the other language.
  • Monolingual target set- A statistical model built using monolingual data in the target language and used by the decoder to try to ensure the fluency of the output.
There are two main components in the Moses
  • Training Pipeline- Take the raw data and turn it into a machine translation model
  • Decoder- Translate the source sentence into the target language
Decoder Modules 
  • Input: This can be a plain sentence, or it can be annotated with xml-like elements to guide the translation process, or it can be a more complex structure like a lattice or confusion network.
  • Translation model: This can use phrase-phrase rules, or hierarchical (perhaps syntactic) rules.
  • Decoding algorithm: Decoding is a huge search problem.
  • Language model: Moses supports several different language model toolkits (SRILM, KenLM, IRSTLM, RandLM)
 The process of Moses system is described in a picture is given below.



 We need to install Moses to train the system and get output. So in the next blog we will see how to install Moses in the Ubuntu operating system.