1. Project Overview

Given a similarity function and two sets of objects, a similarity join returns all pairs of objects (from each set respectively) such that their similarity value satisifies a given criterion. A typical example is to find pairs of documents such that their cosine similarity is above a constant threshold (say, 0.95), as they are likely to be near duplicate documents.

In this project, we focus on efficient algorithms to perform the similarity join exactly on (multi-) sets or strings. Commonly used similarity functions for (multi-) sets or strings are more complex than Euclidean distance functions. As a result, many previous approaches have to compute the approximate results instead.

We have developed several fast algorithms that address the above performance issue. They work for Jaccard, Dice, Cosine similarities and Edit distance constraints. We have also devised algorithms to compute top-k similar pairs progressively so that a user does not need to specify a similarity threshold for some unknown dataset. Our latest work can extract approximate mentions of entities in a document given an entity dictionary.

This project is partly funded by ARC Discovery Projects DP0987273 and DP0881779.

2. People

Academic Staff:

PhD Student:

3. Publications

  1. Haichuan Shang, Xuemin Lin, Wei Wang, Jeffrey Xu Yu, Ying Zhang. NGsearch: A Novel Indexing and Query Processing Method for Subgraph Similarity Search. Submitted for publication.

  2. Wei Wang, Chuan Xiao, Xuemin Lin, Chengqi Zhang. Efficient Approximate Entity Extraction with Edit Distance Constraints. SIGMOD 2009.

    • PDF

    • Slides

    • Summary: This work presents an efficient algorithm to extract approximate mentions of entities from incoming documents. E.g., we can match Ronald Regan to Ronald Reagan in the Reuters news. The algorithm is also interesting in its own right as it generalizes the neighborhood-generation-base approach for approximate string matching.

  3. Wei Wang. Efficient Exact Similarity Join Algorithms. Seminar at University of Technology, Sydney, Oct 2009.

  4. Chuan Xiao, Wei Wang, Xuemin Lin, Haichuan Shang. Top-k Set Similarity Joins. ICDE 2009.

    • PDF

    • Slides

    • Summary: Not sure about which similarity threshold to use when performing similarity join? Then this top-k similarity join algorithm is for you; it computes the k pairs of records that have the highest similarity value in a progressive manner.

  5. Wei Wang. Similarity Join Algorithms: An Introduction. Tutorial at SEBD 2008.

    • Slides (ppt slides available upon request).

  6. Chuan Xiao, Wei Wang, Xuemin Lin. Ed-Join: An Efficient Algorithm for Similarity Joins with Edit Distance Constraints. VLDB 2008.

    • PDF

    • Slides

    • Summary: An efficient algorithm, Ed-Join for similarity joins with edit distance constraints. Ed-Join is capable of computing edit similarity join for half a million protein sequences with edit distance threshold up to 20 within a minute.

  7. Chuan Xiao, Wei Wang, Xuemin Lin, Jeffrey Xu Yu. Efficient Similarity Joins for Near Duplicate Detection. WWW 2008.

    • PDF

    • Slides

    • Summary: Efficient similarity join algorithms (ppjoin and ppjoin+) for several commonly used similarity functions (including Jaccard, cosine, overlap, edit distance). (Ed-Join is recommended for similarity join with edit distance constraint though)

4. Download