Turkish dataset for semantic textual similarity [Anlamsal metin benzerliği için Türkçe veri kümesi]

Beken Fikri, Figen and Oflazer, Kemal and Yanıkoğlu, Berrin (2021) Turkish dataset for semantic textual similarity [Anlamsal metin benzerliği için Türkçe veri kümesi]. In: 29th Signal Processing and Communications Applications Conference (SIU), Istanbul, Turkey

Full text not available from this repository. (Request a copy)


Semantic textual similarity is the task of determining how similar two texts are. In this study, we present the first Turkish evaluation benchmark dataset for semantic textual similarity. We created the dataset by translating the English STS benchmark (STSb) dataset via Google Cloud Translation API and provided various benchmark results. We used Language- Agnostic SEntence Representations (LASER), Language-agnostic BERT Sentence Embedding (LaBSE), Multilingual Universal Sentence Encoder (MUSE) and pre-trained BERT/RoBERTa models to compute sentence embeddings. We also fine-tuned pretrained BERT/RoBERTa models to compute similarity scores. We further fine tuned pre-trained BERT/RoBERTa models with SBERT architecture. In our experimental designs, we used the Turkish natural language inference (NLI-TR) dataset as well. The model performances were computed by Pearson and Spearman's correlation coefficient between the predicted similarity scores and the gold labels. The best results were obtained by fine-tuning the BERTurk model first on NLI-TR dataset, then on the STSb-TR dataset.
Item Type: Papers in Conference Proceedings
Divisions: Faculty of Engineering and Natural Sciences > Academic programs > Computer Science & Eng.
Faculty of Engineering and Natural Sciences
Depositing User: Berrin Yanıkoğlu
Date Deposited: 03 Sep 2021 16:11
Last Modified: 19 Aug 2022 15:23
URI: https://research.sabanciuniv.edu/id/eprint/42387

Actions (login required)

View Item
View Item