Najafi, Ali (2024) Developing Turkish Language Models On SocialMedia. [Thesis]

10656301.pdf
Download (1MB)
Abstract
Turkish is one of the most spoken languages in the world; however, it is still among
the low-resource languages. Wide us of this language on social media platforms
such as Twitter, Instagram, or Tiktok and strategic position of the country in the
world politics makes it appealing for the social network researchers and industry. To
address this need, we introduce TurkishBERTweet, the first large scale pre-trained
language model for Turkish social media built using over 894 million Turkish tweets.
The model shares the same architecture as RoBERTa-base model with smaller input
length, making TurkishBERTweet lighter than the most used model, called
BERTurk, and can have significantly lower inference time. We trained our model
using the same approach for RoBERTa model and evaluated on two tasks: Sentiment
Classification and Hate Speech Detection. We demonstrate that TurkishBERTweet
outperforms the other available alternatives on generalizability and its lower inference
time gives significant advantage to process large-scale datasets. We also show
custom preprocessors for social media can acquire information from platform specific
entities. We also conduct comparison with the commercial solutions like OpenAI
and Gemini, and other available Turkish LLMs in terms of cost and performance to
demonstrate TurkishBERTweet is scalable and cost-effective.
Item Type: | Thesis |
---|---|
Uncontrolled Keywords: | TurkishBERTweet, Sentiment Analysis, HateSpeech Detection,ChatGPT, Special Tokenizer -- TurkishBERTweet, Duygu Analizi, Nefret Söylemi Tespiti,ChatGPT, Special Tokenizer |
Subjects: | T Technology > TK Electrical engineering. Electronics Nuclear engineering > TK7800-8360 Electronics > TK7885-7895 Computer engineering. Computer hardware |
Divisions: | Faculty of Engineering and Natural Sciences > Academic programs > Computer Science & Eng. Faculty of Engineering and Natural Sciences |
Depositing User: | Dila Günay |
Date Deposited: | 18 Feb 2025 13:34 |
Last Modified: | 18 Feb 2025 14:09 |
URI: | https://research.sabanciuniv.edu/id/eprint/51396 |