Efficient top-k similarity document search utilizing distributed file systems and cosine similarity

Alewiwi, Mahmoud Khaled and Örencik, Cengiz and Savaş, Erkay (2016) Efficient top-k similarity document search utilizing distributed file systems and cosine similarity. Cluster Computing, 19 (1). pp. 109-126. ISSN 1386-7857 (Print) 1573-7543 (Online)

This is the latest version of this item.

PDF
article_01.pdf
Restricted to Registered users only
Download (1MB) | Request a copy

Official URL: http://dx.doi.org/10.1007/s10586-015-0506-0

Abstract

Document similarity has important real life applications such as finding duplicate web sites and identifying plagiarism. While the basic techniques such as k-similarity algorithms have been long known, overwhelming amount of data, being collected such as in big data setting, calls for novel algorithms to find highly similar documents in reasonably short amount of time. In particular, pairwise comparison of documents' features, a key operation in calculating document similarity, necessitates prohibitively high storage and computation power. In this paper, we propose a new filtering technique that decreases the number of comparisons between the query set and the search set to find highly similar documents. The proposed filtering technique utilizes Z-order prefix, based on the cosine similarity measure, in which only the most important features are used first to find highly similar documents. We propose a three-phase approach, where the phases are near duplicate detection, common important terms and join phase. We utilize the Hadoop distributed file system and the MapReduce parallel programming model to scale our techniques to big data setting. Our experimental results on real data show that the proposed method performs better than the previous work in the literature in terms of the number of joins, and therefore, speed.

Item Type:	Article
Uncontrolled Keywords:	Z-order; Document similarity; MapReduce; Hadoop; Cosine similarity
Subjects:	Q Science > QA Mathematics > QA075 Electronic computers. Computer science Q Science > QA Mathematics > QA076 Computer software
Divisions:	Faculty of Engineering and Natural Sciences > Academic programs > Computer Science & Eng. Faculty of Engineering and Natural Sciences
Depositing User:	Erkay Savaş
Date Deposited:	11 Nov 2016 22:41
Last Modified:	26 Apr 2022 09:37
URI:	https://research.sabanciuniv.edu/id/eprint/30189

Available Versions of this Item

Efficient top-k similarity document search utilizing distributed file systems and cosine similarity. (deposited 22 Dec 2015 15:07)
- Efficient top-k similarity document search utilizing distributed file systems and cosine similarity. (deposited 11 Nov 2016 22:41) [Currently Displayed]

Actions (login required)

: View Item