1. Project Overview

Given a similarity function and two sets of objects, a similarity join returns all pairs of objects (from each set respectively) such that their similarity value satisifies a given criterion. A typical example is to find pairs of documents such that their cosine similarity is above a constant threshold (say, 0.95), as they are likely to be near duplicate documents.

In this project, we focus on efficient algorithms to perform the similarity join exactly on (multi-) sets or strings. Commonly used similarity functions for (multi-) sets or strings are more complex than Euclidean distance functions. As a result, many previous approaches have to compute the approximate results instead.

We have developed several fast algorithms that address the above performance issue. They work for Jaccard, Dice, Cosine similarities and Edit distance constraints. We have also devised algorithms to compute top-k similar pairs progressively so that a user does not need to specify a similarity threshold for some unknown dataset. Our latest work can extract approximate mentions of entities in a document given an entity dictionary.

This project is partly funded by ARC Discovery Projects DP0987273 and DP0881779.

2. People

Academic Staff:

PhD Student:

3. Publications

  1. Chaokun Wang, Jianmin Wang, Xuemin Lin, Wei Wang, Haixun Wang, Hongsong Li, Wanpeng Tian, Jun Xu, Rui Li. MapDupReducer: Detecting Near Duplicates over Massive Datasets. SIGMOD 2010.

    • PDF

    • PPT

    • Summary: This work essentially ports the ppjoin+ algorithm to the Map-Reduce framework in order to deal with huge volume of data.

  2. Haichuan Shang, Xuemin Lin, Ying Zhang, Jeffrey Xu Yu, Wei Wang. Connected Substructure Similarity Search. SIGMOD 2010.

    • PDF

    • PPT

    • Summary: This work defines a new (assymmetric) similarity measure between graphs and proposes a new indexing and query processing method to perform efficient subgraph similarity search based on this new measure.

  3. Wei Wang, Jianbin Qin, Chuan Xiao, Xuemin Lin, Heng Tao Shen. VChunkJoin: An Efficient Algorithm for Edit Similarity Joins. Submitted for publication.

    • n/a

    • n/a

    • Summary: This work proposes a novel method to efficiently process edit similarity join. Compared with existing methods (e.g., ed-join, vgram-join), it uses less space yet it is faster in many parameter settings.

  4. Wei Wang, Chuan Xiao, Xuemin Lin, Chengqi Zhang. Efficient Approximate Entity Extraction with Edit Distance Constraints. SIGMOD 2009.

    • PDF

    • Slides

    • Summary: This work presents an efficient algorithm to extract approximate mentions of entities from incoming documents. E.g., we can match Ronald Regan to Ronald Reagan in the Reuters news. The algorithm is also interesting in its own right as it generalizes the neighborhood-generation-base approach for approximate string matching.

  5. Wei Wang. Efficient Exact Similarity Join Algorithms. Seminar at University of Technology, Sydney, Oct 2009.

  6. Chuan Xiao, Wei Wang, Xuemin Lin, Haichuan Shang. Top-k Set Similarity Joins. ICDE 2009.

    • PDF

    • Slides

    • Summary: Not sure about which similarity threshold to use when performing similarity join? Then this top-k similarity join algorithm is for you; it computes the k pairs of records that have the highest similarity value in a progressive manner.

  7. Wei Wang. Similarity Join Algorithms: An Introduction. Tutorial at SEBD 2008.

    • Slides (ppt slides available upon request).

  8. Chuan Xiao, Wei Wang, Xuemin Lin. Ed-Join: An Efficient Algorithm for Similarity Joins with Edit Distance Constraints. VLDB 2008.

    • PDF

    • Slides

    • Summary: An efficient algorithm, Ed-Join for similarity joins with edit distance constraints. Ed-Join is capable of computing edit similarity join for half a million protein sequences with edit distance threshold up to 20 within a minute.

  9. Chuan Xiao, Wei Wang, Xuemin Lin, Jeffrey Xu Yu. Efficient Similarity Joins for Near Duplicate Detection. WWW 2008.

    • PDF

    • Slides

    • Summary: Efficient similarity join algorithms (ppjoin and ppjoin+) for several commonly used similarity functions (including Jaccard, cosine, overlap, edit distance). (Ed-Join is recommended for similarity join with edit distance constraint though)

4. Download