Büyük, Osman (2005) Sub-world language modelling for Turkish speech recognition. [Thesis]
PDF
buyukosman.pdf
Download (506kB)
buyukosman.pdf
Download (506kB)
Abstract
In large vocabulary continuous speech recognition (LVCSR) for agglutinative languages, we encounter problems due to theoretically infinite full-word lexicon size. Sub-word lexicon units may be utilized to dramatically reduce the out-of-vocabulary rate in test data. One can develop language models based on sub-word units to perform LVCRS. However, it has not always been beneficial to use sub-word lexicon units, since shorter units have higher acoustic confusability among them and language model history is effectively shorter as compared to the history in full-world language models. To reduce the aforementioned problems, we propose using the longest possible sub-word units in our lexicon, namely half-words and full-words only. We also incorporate linguistic rules of half word combination into our statistical language model. The language constraints are represented with a rule -based WFSM, which can be combined with an N-gram language model to yield a better and smaller language model. We study the performance of the proposed system for Turkish LVCRS when the language constraint takes the form of enforcing vowel harmony between stems and endings. We also introduce novel error-rate metrics that are more appropriate than word-error-rate for agglutinative languages. Using half-words with a bi-gram full-word model. In addition, combining a tri-gram half-word language model with the vowel-harmony WFSM significantly improves the accuracy further when re-scoring the bi-gram lattices.
Item Type: | Thesis |
---|---|
Uncontrolled Keywords: | Language modelling |
Subjects: | T Technology > TK Electrical engineering. Electronics Nuclear engineering |
Divisions: | Faculty of Engineering and Natural Sciences > Academic programs > Telecommunications Faculty of Engineering and Natural Sciences |
Depositing User: | IC-Cataloging |
Date Deposited: | 15 Apr 2008 15:52 |
Last Modified: | 26 Apr 2022 09:45 |
URI: | https://research.sabanciuniv.edu/id/eprint/8284 |