Bilgin Taşdemir, Esma Fatıma and Yanıkoğlu, Berrin (2018) Large vocabulary recognition for online Turkish handwriting with sublexical units. Turkish Journal of Electrical Engineering and Computer Sciences, 26 (5). pp. 2218-2233. ISSN 1300-0632 (Print) 1303-6203 (Online)
PDF
tjeecs.pdf
Download (374kB)
tjeecs.pdf
Download (374kB)
Official URL: http://dx.doi.org/10.3906/elk-1801-234
Abstract
We present a system for large vocabulary recognition of online Turkish handwriting, using hidden Markov models. While using a traditional approach for the recognizer, we have identified and developed solutions for the main problems specific to Turkish handwriting recognition. First, since large amounts of Turkish handwriting samples are not available, the system is trained and optimized using the large UNIPEN dataset of English handwriting, before extending it to Turkish using a small Turkish dataset. The delayed strokes, which pose a significant source of variation in writing order due to the large number of diacritical marks in Turkish, are removed during preprocessing. Finally, as a solution to the high out-of-vocabulary rates encountered when using a fixed size lexicon in general purpose recognition, a lexicon is constructed from sublexical units (stems and endings) learned from a large Turkish corpus. A statistical bigram language model learned from the same corpus is also applied during the decoding process.
The system obtains a 91.7% word recognition rate when tested on a small Turkish handwritten word dataset using a medium sized (1950 words) lexicon corresponding to the vocabulary of the test set and 63.8% using a large, general purpose lexicon (130,000 words). However, with the proposed stem+ending lexicon (12,500 words) and bigram language model with lattice expansion, a 67.9% word recognition accuracy is obtained, surpassing the results obtained with the general purpose lexicon while using a much smaller one.
Item Type: | Article |
---|---|
Uncontrolled Keywords: | Online handwriting recognition; Turkish handwriting recognition; hidden Markov models; statistical language modeling; UNIPEN; grammatical sublexical units; delayed strokes |
Divisions: | Faculty of Engineering and Natural Sciences > Academic programs > Computer Science & Eng. Faculty of Engineering and Natural Sciences |
Depositing User: | Berrin Yanıkoğlu |
Date Deposited: | 23 Jan 2019 15:32 |
Last Modified: | 06 Jun 2023 12:25 |
URI: | https://research.sabanciuniv.edu/id/eprint/36813 |