Exploring different representational units in English-to-Turkish statistical machine translation

Oflazer, Kemal and Durgar El-Kahlout, İlknur (2007) Exploring different representational units in English-to-Turkish statistical machine translation. In: Statistical Machine Translation Workshop at ACL 2007, Prague, Czech Republic

[thumbnail of WMT04.pdf] PDF
WMT04.pdf

Download (452kB)

Abstract

We investigate different representational granularities for sub-lexical representation in statistical machine translation work from English to Turkish. We find that (i) representing both Turkish and English at the morpheme-level but with some selective morpheme-grouping on the Turkish side of the training data, (ii) augmenting the training data with “sentences” comprising only the content words of the original training data to bias root word alignment, (iii) reranking the n-best morpheme-sequence outputs of the decoder with a word-based language model, and (iv) using model iteration all provide a non-trivial improvement over a fully word-based baseline. Despite our very limited training data, we improve from 20.22 BLEU points for our simplest model to 25.08 BLEU points for an improvement of 4.86 points or 24% relative.
Item Type: Papers in Conference Proceedings
Subjects: Q Science > QA Mathematics > QA075 Electronic computers. Computer science
P Language and Literature > P Philology. Linguistics
Q Science > QA Mathematics > QA076 Computer software
Divisions: Faculty of Engineering and Natural Sciences
Depositing User: Kemal Oflazer
Date Deposited: 24 Oct 2007 22:27
Last Modified: 26 Apr 2022 08:42
URI: https://research.sabanciuniv.edu/id/eprint/6397

Actions (login required)

View Item
View Item