Şen, Mehmet Umut and Bilecen, Ali Enver and Bilgin Tasdemir, Esma Fatima and Yanıkoğlu, Berrin (2025) Transcription of Ottoman documents using transformer based models. In: 33rd Signal Processing and Communications Applications Conference (SIU), Istanbul, Turkiye
Full text not available from this repository. (Request a copy)
Official URL: https://dx.doi.org/10.1109/SIU66497.2025.11112382
Abstract
Although access to a large number of Ottoman documents has become easier today, the Arabic-Persian-based Ottoman script remains a barrier for interested users in utilizing these documents. To address this challenge, there is a need for automatic transcription systems. While some deep learning-based commercial and academic models exist for Ottoman transcription, no studies have yet explored models based on transformer architectures. This paper introduces an Ottoman transcription system developed using TrOCR, a transformer-based model. Instead of the commonly used two-step approach in the literature, a model was designed to perform both optical character recognition and transcription into Turkish in one step. Additionally, the decoder responsible for language modeling was initialized with a BERT-based model trained on Turkish data, achieving results comparable to the original model. During testing, this model produced outputs more quickly due to improved tokenization performance.
Item Type: | Papers in Conference Proceedings |
---|---|
Uncontrolled Keywords: | BERT; BERTurk; Optical Character Recognition; Ottoman; transcription; transformer models; TrOCR |
Divisions: | Faculty of Engineering and Natural Sciences > Academic programs > Computer Science & Eng. Faculty of Engineering and Natural Sciences |
Depositing User: | Berrin Yanıkoğlu |
Date Deposited: | 29 Sep 2025 15:35 |
Last Modified: | 29 Sep 2025 15:35 |
URI: | https://research.sabanciuniv.edu/id/eprint/52549 |