The NIKE Group


>> Overview    




Data Linkage


In the (Data) Linkage project, a sequel of the Quagga data cleaning project, we re-visit the extensively-studied traditional (record) linkage problem of identifying matching entities in a collection, and argue that novel solutions be needed to cope with novel challenges. For instance, we pursue the following novel directions:

  • Web based linkage: When one cannot determine well if two entities are matching or not due to the lack of evidences, we propose to use the Web as the ultimate knowledge source.

  • Parallel linkage: By turning existing data linkage solutions to parallel programs, one can achieve substantial speed-up. However, due to intricate interplay of match vs. merge operations in the data linkage, the parallelization requires a careful design.

  • Group linkage: When entities to match are no longer simple records but have non-trivial internal properties or structures (such as groups), its exploitation can bring a significant improvement to the data linkage.

  • Adaptive linkage: Often, in existing data linkage solutions, various parameters need to be set once (by users) and do not change during the execution. However, by adaptively changing the values to maximize objective functions, one can substantially reduce false negatives.

$ Last generated: Fri Jul 6 00:45:37 2007 EST $