Tweets on a tree: index-based clustering of tweets

Erpam, Mert Kemal (2017) Tweets on a tree: index-based clustering of tweets. [Thesis]

[thumbnail of MertkemalErpam_10135408.pdf] PDF
MertkemalErpam_10135408.pdf

Download (1MB)

Abstract

Computer-mediated communication, CMC, is a type of communication that occurs through use of two or more electronic devices. With the advancement of technology, CMC has started to become a more preferred type of communication between humans. Through computer-mediated technologies, news portals, search engines and social media platforms such as Facebook, Twitter, Reddit and many other platforms are created. In social media platforms, a user can post and discuss his/her own opinion and also read and share other users' opinions. This generates a signi cant amount of data which, if ltered and analyzed, can give researchers important insights about public opinion and culture. Twitter is a social networking service founded in 2006 and became widespread throughout the world in a very short time frame. The service has more than 310 million monthly active users and throughout these users more than 500 million tweets are generated daily as of 2016. Due the volume, velocity and variety of Twitter data, it cannot be analyzed by using conventional methods. A clustering or sampling method is necessary to reduce the amount of data for analysis. To cluster documents, in a very broad sense two similarity measures can be used: Lexical similarity and semantic similarity. Lexical similarity looks for syntactic similarity between documents. It is usually computationally light to compute lexical similarity, however for clustering purposes it may not be very accurate as it disregards the semantic value of words. On the other hand, semantic similarity looks for semantic value and relations between words to calculate the similarity and while it is generally more accurate than lexical similarity, it is computationally di cult to calculate semantic similarity. In our work we aim to create computationally light and accurate clustering of short documents which have the characteristics of big data. We propose a hybrid approach of clustering where lexical and semantic similarity is combined together. In our approach, we use string similarity to create clusters and semantic vector representations of words to interactively merge clusters.
Item Type: Thesis
Additional Information: Yükseköğretim Kurulu Tez Merkezi Tez No: 459029.
Uncontrolled Keywords: Clustering. -- Twitter. -- Summarization. -- Suffix tree. -- Semantic relatedness. Data mining. -- Kümelemek. -- Twitter. -- Sonek ağacı. -- Anlamsal İlişkililik. -- Veri madenciliği.
Subjects: T Technology > TK Electrical engineering. Electronics Nuclear engineering > TK7800-8360 Electronics > TK7885-7895 Computer engineering. Computer hardware
Divisions: Faculty of Engineering and Natural Sciences > Academic programs > Computer Science & Eng.
Faculty of Engineering and Natural Sciences
Depositing User: IC-Cataloging
Date Deposited: 13 Apr 2018 09:28
Last Modified: 26 Apr 2022 10:16
URI: https://research.sabanciuniv.edu/id/eprint/34418

Actions (login required)

View Item
View Item