Impact assessment and prediction of tweets and topics

Arın, İnanç (2017) Impact assessment and prediction of tweets and topics. [Thesis]

[thumbnail of InancArin_10162659.pdf] PDF

Download (11MB)


People tend to spread information and share their ideas in Twitter, while researchers and policy makers would like to understand public opinion and reactions of people in Twitter towards various events. One way to do that is assessing and predicting the impact of tweets. In this thesis, we tried to answer three questions: (1) "What does impact of a tweet mean?", (2) "How do we measure the impact of tweets or topics?", and (3) "Can we predict the impact of tweets or topics?". In order to address these questions, we first emphasize the role of retweets and their importance in impact assessment. We then show that we can build a model through supervised learning to predict if a tweet will get a high number of retweets. We extracted various features from tweets including content based features through Convolutional Neural Networks (CNN). In order to have a more accurate impact assessment, we introduced the concept of hidden retweets. People tend to re-post tweets by adding some extra comments to the beginning or to the end of original tweet. Also they intentionally or unintentionally post the exact or near exact tweets with other people without explicitly retweeting them. Therefore hidden retweets are quite important for measuring the real impact of tweets. However, it is also computationally expensive to identify and count the number of hidden retweets. We show that aggregating hidden retweets can be done efficiently through a lexical similarity based clustering algorithm enhanced with a tree structured index and locality sensitive hashing. We adopted a document clustering based approach for discovering the hidden retweets. We developed and evaluated several clustering algorithms with lexical similarity as the distance measure between tweets. Longest Common Subsequence (LCS) is a widely accepted method to calculate the lexical similarity between short text documents such as tweets, but it is also very expensive. Therefore, we utilized an advanced data structure which is Generalized Suffix Tree (GST) based on Longest Common Substring which is an approximation of LCS. We, then developed a density based clustering approach based for tweet clustering and improved its performance by integrating GST and Locality Sensitive Hashing.
Item Type: Thesis
Additional Information: Yükseköğretim Kurulu Tez Merkezi Tez No: 478661.
Uncontrolled Keywords: Impact prediction. -- Hidden retweets. -- Tweet clustering. -- Lexical based. -- Clustering. -- Density based clustering. -- Generalized suffix tree. -- Locality sensitive Hashing. -- Etki Tahmini. -- Gizli Retweetler. -- Tweet kümeleme. -- Karakter bazında kümeleme. -- Yoğunluk bazında kümeleme. -- Genelleştirilmiş son ek ağacı. -- Lokal duyarlılık. -- Adresleme.
Subjects: T Technology > TK Electrical engineering. Electronics Nuclear engineering > TK7800-8360 Electronics > TK7885-7895 Computer engineering. Computer hardware
Divisions: Faculty of Engineering and Natural Sciences > Academic programs > Computer Science & Eng.
Faculty of Engineering and Natural Sciences
Depositing User: IC-Cataloging
Date Deposited: 07 May 2018 15:06
Last Modified: 26 Apr 2022 10:21

Actions (login required)

View Item
View Item