Najafi, Ali (2024) Developing Turkish language models on social media. [Thesis]

10656301.pdf
Download (1MB)
Abstract
Turkish is one of the most spoken languages in the world; however, it is still amongthe low-resource languages. Wide us of this language on social media platformssuch as Twitter, Instagram, or Tiktok and strategic position of the country in theworld politics makes it appealing for the social network researchers and industry. Toaddress this need, we introduce TurkishBERTweet, the first large scale pre-trainedlanguage model for Turkish social media built using over 894 million Turkish tweets.The model shares the same architecture as RoBERTa-base model with smaller inputlength, making TurkishBERTweet lighter than the most used model, calledBERTurk, and can have significantly lower inference time. We trained our modelusing the same approach for RoBERTa model and evaluated on two tasks: SentimentClassification and Hate Speech Detection. We demonstrate that TurkishBERTweetoutperforms the other available alternatives on generalizability and its lower inferencetime gives significant advantage to process large-scale datasets. We also showcustom preprocessors for social media can acquire information from platform specificentities. We also conduct comparison with the commercial solutions like OpenAIand Gemini, and other available Turkish LLMs in terms of cost and performance todemonstrate TurkishBERTweet is scalable and cost-effective.
Item Type: | Thesis |
---|---|
Uncontrolled Keywords: | TurkishBERTweet, Sentiment Analysis, HateSpeech Detection,ChatGPT, Special Tokenizer. -- TurkishBERTweet, Duygu Analizi, Nefret Söylemi Tespiti,ChatGPT, Special Tokenizer. |
Subjects: | T Technology > TK Electrical engineering. Electronics Nuclear engineering > TK7800-8360 Electronics > TK7885-7895 Computer engineering. Computer hardware |
Divisions: | Faculty of Engineering and Natural Sciences > Academic programs > Computer Science & Eng. Faculty of Engineering and Natural Sciences |
Depositing User: | Dila Günay |
Date Deposited: | 18 Apr 2025 14:28 |
Last Modified: | 18 Apr 2025 14:46 |
URI: | https://research.sabanciuniv.edu/id/eprint/51710 |