Efficient top-k similarity document search utilizing distributed file systems and cosine similarity

Alewiwi, Mahmoud Khaled and Örencik, Cengiz and Savaş, Erkay (2016) Efficient top-k similarity document search utilizing distributed file systems and cosine similarity. Cluster Computing, 19 (1). pp. 109-126. ISSN 1386-7857 (Print) 1573-7543 (Online)

This is the latest version of this item.

[img]PDF - Registered users only - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader

Official URL: http://dx.doi.org/10.1007/s10586-015-0506-0


Document similarity has important real life applications such as finding duplicate web sites and identifying plagiarism. While the basic techniques such as k-similarity algorithms have been long known, overwhelming amount of data, being collected such as in big data setting, calls for novel algorithms to find highly similar documents in reasonably short amount of time. In particular, pairwise comparison of documents' features, a key operation in calculating document similarity, necessitates prohibitively high storage and computation power. In this paper, we propose a new filtering technique that decreases the number of comparisons between the query set and the search set to find highly similar documents. The proposed filtering technique utilizes Z-order prefix, based on the cosine similarity measure, in which only the most important features are used first to find highly similar documents. We propose a three-phase approach, where the phases are near duplicate detection, common important terms and join phase. We utilize the Hadoop distributed file system and the MapReduce parallel programming model to scale our techniques to big data setting. Our experimental results on real data show that the proposed method performs better than the previous work in the literature in terms of the number of joins, and therefore, speed.

Item Type:Article
Uncontrolled Keywords:Z-order; Document similarity; MapReduce; Hadoop; Cosine similarity
Subjects:Q Science > QA Mathematics > QA075 Electronic computers. Computer science
Q Science > QA Mathematics > QA076 Computer software
ID Code:30189
Deposited By:Erkay Savaş
Deposited On:11 Nov 2016 22:41
Last Modified:11 Nov 2016 22:41

Available Versions of this Item

Repository Staff Only: item control page