Tasdemir, Esma F. Bilgin and Tandoğan, Zeynep and Akansu, S. Doğan and Kızılırmak, Fırat and Şen, Mehmet Umut and Akcan, Aysu and Kuru, Mehmet and Yanıkoğlu, Berrin (2024) Automatic transcription of Ottoman documents using deep learning. In: 16th IAPR International Workshop on Document Analysis Systems, DAS 2024, Athens, Greece
Full text not available from this repository. (Request a copy)
Official URL: https://dx.doi.org/10.1007/978-3-031-70442-0_26
Abstract
With the accelerated pace of digitization, a vast collection of Ottoman documents has become accessible to researchers and the general public. However, most users interested in these documents are unable to read them, as the text is Turkish written in the Arabic-Persian script. Manual transcription of such a massive amount of documents is also beyond the capacity of human experts. With the advancements in deep learning, we have been able to provide a solution to the long-standing problem of automatic transcription of printed Ottoman documents. We evaluated three decoding strategies including Word Beam Search that allows to use a recognition lexicon and n-gram statistics during the decoding phase. Furthermore, the effect of lexicon size and coverage and language modelling via character or word n-grams are also evaluated. Using a general purpose large lexicon of the Ottoman era (260K words and 86% test coverage), the performance is measured as 6.59% character error rate and 28.46% word error rate on a test set of 6, 828 text lines.
Item Type: | Papers in Conference Proceedings |
---|---|
Uncontrolled Keywords: | Deep Learning; Ottoman Document Recognition; Turkish |
Divisions: | Faculty of Arts and Social Sciences Faculty of Engineering and Natural Sciences |
Depositing User: | Berrin Yanıkoğlu |
Date Deposited: | 01 Oct 2024 14:59 |
Last Modified: | 01 Oct 2024 14:59 |
URI: | https://research.sabanciuniv.edu/id/eprint/50453 |