Syntax-to-morphology alignment and constituent reordering in factored phrase-based statistical machine translation from English to Turkish

Yeniterzi, Reyyan (2009) Syntax-to-morphology alignment and constituent reordering in factored phrase-based statistical machine translation from English to Turkish. [Thesis]

[thumbnail of ReyyanYeniterzi.pdf] PDF
ReyyanYeniterzi.pdf

Download (1MB)

Abstract

English is a moderately analytic language in which the meaning is conveyed with function words and the order of constituents. On the other hand, Turkish is an agglutinative language with free constituent order. These differences together with the lack of large scale English-Turkish parallel corpora turn Statistical Machine Translation (SMT) between these languages into a challenging problem. SMT between these two languages, especially from English to Turkish has been worked on for several years. The initial findings [El-Kahlout and Of lazer, 2006] strongly support the idea of representing both Turkish and English at the morpheme-level. Furthermore, several representations and groupings for the morphological structure have been tried on the Turkish side. In contrast to these, this thesis mostly focuses on the experiments on the English side rather than Turkish. In this work we firstly introduce a new way to align the English syntax with the Turkish morphology by associating function words to their related content words. This transformation solely depends on the dependency relations between these words. In addition to this improved alignment, a syntactic reordering is performed to get a more monotonic word alignment. Here, we again use dependencies to identify the sentence constituents and perform reordering between them so that the word order of the source side will be close to the target language. We report our results with BLEU which is a measure that is widely used by the MT community to report research results. With improvements in the alignment and the ordering, we have increased our BLEU score from a baseline score of 17.08 to 23.78, which is an improvement of 6.7 BLEU points, or about 39% relative.
Item Type: Thesis
Uncontrolled Keywords: Statistical machine translation. -- Factored translation model. -- Syntactic alignment and reordering. -- İstatistiksel bilgisayarlı çeviri. -- Faktörlü çeviri modeli. -- Sentaks ile eşleştirme ve yeniden sıralama.
Subjects: T Technology > TK Electrical engineering. Electronics Nuclear engineering > TK7800-8360 Electronics > TK7885-7895 Computer engineering. Computer hardware
Divisions: Faculty of Engineering and Natural Sciences > Academic programs > Computer Science & Eng.
Faculty of Engineering and Natural Sciences
Depositing User: IC-Cataloging
Date Deposited: 01 Jul 2011 15:01
Last Modified: 26 Apr 2022 09:54
URI: https://research.sabanciuniv.edu/id/eprint/16588

Actions (login required)

View Item
View Item