TurkishBERTweet: fast and reliable large language model for social media analysis

Najafi, Ali and Varol, Onur (2024) TurkishBERTweet: fast and reliable large language model for social media analysis. Expert Systems with Applications, 255 (Part C). ISSN 0957-4174 (Print) 1873-6793 (Online)

Full text not available from this repository. (Request a copy)

Abstract

Turkish is one of the most spoken languages in the world; however, it is still among the low-resource languages. Wide us of this language on social media platforms such as Twitter, Instagram, or Tiktok and strategic position of the country in the world politics makes it appealing for the social network researchers and industry. To address this need, we introduce TurkishBERTweet, the first large scale pre-trained language model for Turkish social media built using over 894 million Turkish tweets. The model shares the same architecture as RoBERTa-base model with smaller input length, making TurkishBERTweet lighter than the most used model, called BERTurk, and can have significantly lower inference time. We trained our model using the same approach for RoBERTa model and evaluated on two tasks: Sentiment Classification and Hate Speech Detection. We demonstrate that TurkishBERTweet outperforms the other available alternatives on generalizability and its lower inference time gives significant advantage to process large-scale datasets. We also show custom preprocessors for social media can acquire information from platform specific entities. We also conduct comparison with the commercial solutions like OpenAI and Gemini, and other available Turkish LLMs in terms of cost and performance to demonstrate TurkishBERTweet is scalable and cost-effective.
Item Type: Article
Divisions: Center of Excellence in Data Analytics
Faculty of Engineering and Natural Sciences
Depositing User: Onur Varol
Date Deposited: 29 Aug 2024 15:34
Last Modified: 29 Aug 2024 15:34
URI: https://research.sabanciuniv.edu/id/eprint/49635

Actions (login required)

View Item
View Item