Impact assessment and prediction of tweets and topics
Arın, İnanç (2017) Impact assessment and prediction of tweets and topics. [Thesis]
People tend to spread information and share their ideas in Twitter, while researchers and policy makers would like to understand public opinion and reactions of people in Twitter towards various events. One way to do that is assessing and predicting the impact of tweets. In this thesis, we tried to answer three questions: (1) "What does impact of a tweet mean?", (2) "How do we measure the impact of tweets or topics?", and (3) "Can we predict the impact of tweets or topics?". In order to address these questions, we first emphasize the role of retweets and their importance in impact assessment. We then show that we can build a model through supervised learning to predict if a tweet will get a high number of retweets. We extracted various features from tweets including content based features through Convolutional Neural Networks (CNN). In order to have a more accurate impact assessment, we introduced the concept of hidden retweets. People tend to re-post tweets by adding some extra comments to the beginning or to the end of original tweet. Also they intentionally or unintentionally post the exact or near exact tweets with other people without explicitly retweeting them. Therefore hidden retweets are quite important for measuring the real impact of tweets. However, it is also computationally expensive to identify and count the number of hidden retweets. We show that aggregating hidden retweets can be done efficiently through a lexical similarity based clustering algorithm enhanced with a tree structured index and locality sensitive hashing. We adopted a document clustering based approach for discovering the hidden retweets. We developed and evaluated several clustering algorithms with lexical similarity as the distance measure between tweets. Longest Common Subsequence (LCS) is a widely accepted method to calculate the lexical similarity between short text documents such as tweets, but it is also very expensive. Therefore, we utilized an advanced data structure which is Generalized Suffix Tree (GST) based on Longest Common Substring which is an approximation of LCS. We, then developed a density based clustering approach based for tweet clustering and improved its performance by integrating GST and Locality Sensitive Hashing.
Repository Staff Only: item control page