Transcription of Ottoman documents using transformer based models

Şen, Mehmet Umut and Bilecen, Ali Enver and Bilgin Tasdemir, Esma Fatima and Yanıkoğlu, Berrin (2025) Transcription of Ottoman documents using transformer based models. In: 33rd Signal Processing and Communications Applications Conference (SIU), Istanbul, Turkiye

Full text not available from this repository. (Request a copy)

Official URL: https://dx.doi.org/10.1109/SIU66497.2025.11112382

Abstract

Although access to a large number of Ottoman documents has become easier today, the Arabic-Persian-based Ottoman script remains a barrier for interested users in utilizing these documents. To address this challenge, there is a need for automatic transcription systems. While some deep learning-based commercial and academic models exist for Ottoman transcription, no studies have yet explored models based on transformer architectures. This paper introduces an Ottoman transcription system developed using TrOCR, a transformer-based model. Instead of the commonly used two-step approach in the literature, a model was designed to perform both optical character recognition and transcription into Turkish in one step. Additionally, the decoder responsible for language modeling was initialized with a BERT-based model trained on Turkish data, achieving results comparable to the original model. During testing, this model produced outputs more quickly due to improved tokenization performance.

Item Type:	Papers in Conference Proceedings
Uncontrolled Keywords:	BERT; BERTurk; Optical Character Recognition; Ottoman; transcription; transformer models; TrOCR
Divisions:	Faculty of Engineering and Natural Sciences > Academic programs > Computer Science & Eng. Faculty of Engineering and Natural Sciences
Depositing User:	Berrin Yanıkoğlu
Date Deposited:	29 Sep 2025 15:35
Last Modified:	08 Oct 2025 10:16
URI:	https://research.sabanciuniv.edu/id/eprint/52549

Actions (login required)

: View Item