Record linkage as dna sequence alignment problem

Yoojin Hong, Tao Yang, Jaewoo Kang, Dongwon Lee

Research output: Chapter in Book/Report/Conference proceedingConference contribution

4 Citations (Scopus)

Abstract

Since modern database applications increasingly need to deal with dirty data due to a variety of reasons (e.g., data entry errors, heterogeneous formats, and ambiguous terms), (record) linkage problem to determine if two entities represented as relational records are approximately the same or not. In this paper, we propose a novel idea of using the popular gene sequence alignment algorithm in Biology BLAST. Our proposal, termed as the BLASTed linkage, is based on the observations that: (1) both problems are variants of approximate pattern matching, (2) BLAST provides the statistical guarantee of search results in a scalable manner a greatly lacking feature in many linkage solutions, and (3) by transforming the record linkage problem into the gene sequence alignment problem, one can leverage on a wealth of advanced algorithms, implementations, and tools that have been actively developed for BLAST during the last decade. In translating English alphabets to DNA sequences of A. C, G. and T. we study four variations: (1) default each alphabet is mapped to nucleotides under 1, 2, and 1-bit coding schemes, (2) weighted tokens are elongated or shortened proportional to their importance, making important tokens longer in the resultant DNA sequences, (3) hybrid each token's lexical meaning as well as its importance are considered at the same time during translation, and (4) multi-bit tokens are selected for any of 1. 2, and 1-bit coding schemes based on the cumulative distribution functions of their schemes are experimentally validated using both real and synthetic data sets.

Original languageEnglish
Title of host publicationCTIT workshop proceedings series
Pages13-22
Number of pages10
VolumeWP 08
Edition02
Publication statusPublished - 2008
Event6th International Workshop on Quality in Databases, QDB 2008 and 3rd Workshop on Management of Uncertain Data, MUD 2008 - Auckland, New Zealand
Duration: 2008 Aug 12008 Aug 1

Other

Other6th International Workshop on Quality in Databases, QDB 2008 and 3rd Workshop on Management of Uncertain Data, MUD 2008
CountryNew Zealand
CityAuckland
Period08/8/108/8/1

Fingerprint

DNA sequences
Genes
DNA
coding
gene
Pattern matching
Nucleotides
Distribution functions
Data acquisition
biology
guarantee
alignment
distribution

ASJC Scopus subject areas

  • Information Systems
  • Geography, Planning and Development

Cite this

Hong, Y., Yang, T., Kang, J., & Lee, D. (2008). Record linkage as dna sequence alignment problem. In CTIT workshop proceedings series (02 ed., Vol. WP 08, pp. 13-22)

Record linkage as dna sequence alignment problem. / Hong, Yoojin; Yang, Tao; Kang, Jaewoo; Lee, Dongwon.

CTIT workshop proceedings series. Vol. WP 08 02. ed. 2008. p. 13-22.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Hong, Y, Yang, T, Kang, J & Lee, D 2008, Record linkage as dna sequence alignment problem. in CTIT workshop proceedings series. 02 edn, vol. WP 08, pp. 13-22, 6th International Workshop on Quality in Databases, QDB 2008 and 3rd Workshop on Management of Uncertain Data, MUD 2008, Auckland, New Zealand, 08/8/1.
Hong Y, Yang T, Kang J, Lee D. Record linkage as dna sequence alignment problem. In CTIT workshop proceedings series. 02 ed. Vol. WP 08. 2008. p. 13-22
Hong, Yoojin ; Yang, Tao ; Kang, Jaewoo ; Lee, Dongwon. / Record linkage as dna sequence alignment problem. CTIT workshop proceedings series. Vol. WP 08 02. ed. 2008. pp. 13-22
@inproceedings{e00cd784f16d48acbde05ad84f3230a6,
title = "Record linkage as dna sequence alignment problem",
abstract = "Since modern database applications increasingly need to deal with dirty data due to a variety of reasons (e.g., data entry errors, heterogeneous formats, and ambiguous terms), (record) linkage problem to determine if two entities represented as relational records are approximately the same or not. In this paper, we propose a novel idea of using the popular gene sequence alignment algorithm in Biology BLAST. Our proposal, termed as the BLASTed linkage, is based on the observations that: (1) both problems are variants of approximate pattern matching, (2) BLAST provides the statistical guarantee of search results in a scalable manner a greatly lacking feature in many linkage solutions, and (3) by transforming the record linkage problem into the gene sequence alignment problem, one can leverage on a wealth of advanced algorithms, implementations, and tools that have been actively developed for BLAST during the last decade. In translating English alphabets to DNA sequences of A. C, G. and T. we study four variations: (1) default each alphabet is mapped to nucleotides under 1, 2, and 1-bit coding schemes, (2) weighted tokens are elongated or shortened proportional to their importance, making important tokens longer in the resultant DNA sequences, (3) hybrid each token's lexical meaning as well as its importance are considered at the same time during translation, and (4) multi-bit tokens are selected for any of 1. 2, and 1-bit coding schemes based on the cumulative distribution functions of their schemes are experimentally validated using both real and synthetic data sets.",
author = "Yoojin Hong and Tao Yang and Jaewoo Kang and Dongwon Lee",
year = "2008",
language = "English",
volume = "WP 08",
pages = "13--22",
booktitle = "CTIT workshop proceedings series",
edition = "02",

}

TY - GEN

T1 - Record linkage as dna sequence alignment problem

AU - Hong, Yoojin

AU - Yang, Tao

AU - Kang, Jaewoo

AU - Lee, Dongwon

PY - 2008

Y1 - 2008

N2 - Since modern database applications increasingly need to deal with dirty data due to a variety of reasons (e.g., data entry errors, heterogeneous formats, and ambiguous terms), (record) linkage problem to determine if two entities represented as relational records are approximately the same or not. In this paper, we propose a novel idea of using the popular gene sequence alignment algorithm in Biology BLAST. Our proposal, termed as the BLASTed linkage, is based on the observations that: (1) both problems are variants of approximate pattern matching, (2) BLAST provides the statistical guarantee of search results in a scalable manner a greatly lacking feature in many linkage solutions, and (3) by transforming the record linkage problem into the gene sequence alignment problem, one can leverage on a wealth of advanced algorithms, implementations, and tools that have been actively developed for BLAST during the last decade. In translating English alphabets to DNA sequences of A. C, G. and T. we study four variations: (1) default each alphabet is mapped to nucleotides under 1, 2, and 1-bit coding schemes, (2) weighted tokens are elongated or shortened proportional to their importance, making important tokens longer in the resultant DNA sequences, (3) hybrid each token's lexical meaning as well as its importance are considered at the same time during translation, and (4) multi-bit tokens are selected for any of 1. 2, and 1-bit coding schemes based on the cumulative distribution functions of their schemes are experimentally validated using both real and synthetic data sets.

AB - Since modern database applications increasingly need to deal with dirty data due to a variety of reasons (e.g., data entry errors, heterogeneous formats, and ambiguous terms), (record) linkage problem to determine if two entities represented as relational records are approximately the same or not. In this paper, we propose a novel idea of using the popular gene sequence alignment algorithm in Biology BLAST. Our proposal, termed as the BLASTed linkage, is based on the observations that: (1) both problems are variants of approximate pattern matching, (2) BLAST provides the statistical guarantee of search results in a scalable manner a greatly lacking feature in many linkage solutions, and (3) by transforming the record linkage problem into the gene sequence alignment problem, one can leverage on a wealth of advanced algorithms, implementations, and tools that have been actively developed for BLAST during the last decade. In translating English alphabets to DNA sequences of A. C, G. and T. we study four variations: (1) default each alphabet is mapped to nucleotides under 1, 2, and 1-bit coding schemes, (2) weighted tokens are elongated or shortened proportional to their importance, making important tokens longer in the resultant DNA sequences, (3) hybrid each token's lexical meaning as well as its importance are considered at the same time during translation, and (4) multi-bit tokens are selected for any of 1. 2, and 1-bit coding schemes based on the cumulative distribution functions of their schemes are experimentally validated using both real and synthetic data sets.

UR - http://www.scopus.com/inward/record.url?scp=84882947414&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84882947414&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:84882947414

VL - WP 08

SP - 13

EP - 22

BT - CTIT workshop proceedings series

ER -