Automatic transcription of Ottoman documents using deep learning

Tasdemir, Esma F. Bilgin and Tandoğan, Zeynep and Akansu, S. Doğan and Kızılırmak, Fırat and Şen, Mehmet Umut and Akcan, Aysu and Kuru, Mehmet and Yanıkoğlu, Berrin (2024) Automatic transcription of Ottoman documents using deep learning. In: 16th IAPR International Workshop on Document Analysis Systems, DAS 2024, Athens, Greece

Full text not available from this repository. (Request a copy)

Abstract

With the accelerated pace of digitization, a vast collection of Ottoman documents has become accessible to researchers and the general public. However, most users interested in these documents are unable to read them, as the text is Turkish written in the Arabic-Persian script. Manual transcription of such a massive amount of documents is also beyond the capacity of human experts. With the advancements in deep learning, we have been able to provide a solution to the long-standing problem of automatic transcription of printed Ottoman documents. We evaluated three decoding strategies including Word Beam Search that allows to use a recognition lexicon and n-gram statistics during the decoding phase. Furthermore, the effect of lexicon size and coverage and language modelling via character or word n-grams are also evaluated. Using a general purpose large lexicon of the Ottoman era (260K words and 86% test coverage), the performance is measured as 6.59% character error rate and 28.46% word error rate on a test set of 6, 828 text lines.
Item Type: Papers in Conference Proceedings
Uncontrolled Keywords: Deep Learning; Ottoman Document Recognition; Turkish
Divisions: Faculty of Arts and Social Sciences
Faculty of Engineering and Natural Sciences
Depositing User: Berrin Yanıkoğlu
Date Deposited: 01 Oct 2024 14:59
Last Modified: 01 Oct 2024 14:59
URI: https://research.sabanciuniv.edu/id/eprint/50453

Actions (login required)

View Item
View Item