Given a similarity function and two sets of objects, a similarity join returns all pairs of objects (from each set respectively) such that their similarity value satisifies a given criterion. A typical example is to find pairs of documents such that their cosine similarity is above a constant threshold (say, 0.95), as they are likely to be near duplicate documents.
In this project, we focus on efficient algorithms to perform the similarity join exactly on (multi-) sets or strings. Commonly used similarity functions for (multi-) sets or strings are more complex than Euclidean distance functions. As a result, many previous approaches have to compute the approximate results instead.
We have developed several fast algorithms that address the above performance issue. They work for Jaccard, Dice, Cosine similarities and Edit distance constraints. We have also devised algorithms to compute top-k similar pairs progressively so that a user does not need to specify a similarity threshold for some unknown dataset. Our latest work can extract approximate mentions of entities in a document given an entity dictionary.
This project is partly funded by ARC Discovery Projects DP0987273 and DP0881779.
Haichuan Shang, Xuemin Lin, Wei Wang, Jeffrey Xu Yu, Ying Zhang. NGsearch: A Novel Indexing and Query Processing Method for Subgraph Similarity Search. Submitted for publication.
Wei Wang, Chuan Xiao, Xuemin Lin, Chengqi Zhang. Efficient Approximate Entity Extraction with Edit Distance Constraints. SIGMOD 2009.
Summary: This work presents an efficient algorithm to extract approximate mentions of entities from incoming documents. E.g., we can match Ronald Regan to Ronald Reagan in the Reuters news. The algorithm is also interesting in its own right as it generalizes the neighborhood-generation-base approach for approximate string matching.
Wei Wang. Efficient Exact Similarity Join Algorithms. Seminar at University of Technology, Sydney, Oct 2009.
Chuan Xiao, Wei Wang, Xuemin Lin, Haichuan Shang. Top-k Set Similarity Joins. ICDE 2009.
Wei Wang. Similarity Join Algorithms: An Introduction. Tutorial at SEBD 2008.
Slides (ppt slides available upon request).
Chuan Xiao, Wei Wang, Xuemin Lin. Ed-Join: An Efficient Algorithm for Similarity Joins with Edit Distance Constraints. VLDB 2008.
Chuan Xiao, Wei Wang, Xuemin Lin, Jeffrey Xu Yu. Efficient Similarity Joins for Near Duplicate Detection. WWW 2008.