On Schema Matching with Opaque Column Names and Data Values

Jaewoo Kang, Jeffrey F. Naughton

Research output: Chapter in Book/Report/Conference proceedingConference contribution

145 Citations (Scopus)

Abstract

Most previous solutions to the schema matching problem rely in some fashion upon identifying "similar" column names in the schemas to be matched, or by recognizing common domains in the data stored in the schemas. While each of these approaches is valuable in many cases, they are not infallible, and there exist instances of the schema matching problem for which they do not even apply. Such problem instances typically arise when the column names in the schemas and the data in the columns are "opaque" or very difficult to interpret. In this paper we propose a two-step technique that works even in the presence of opaque column names and data values. In the first step, we measure the pair-wise attribute correlations in the tables to be matched and construct a dependency graph using mutual information as a measure of the dependency between attributes. In the second stage, we find matching node pairs in the dependency graphs by running a graph matching algorithm. We validate our approach with an experimental study, the results of which suggest that such an approach can be a useful addition to a set of (semi) automatic schema matching techniques.

Original languageEnglish
Title of host publicationProceedings of the ACM SIGMOD International Conference on Management of Data
EditorsA.Y. Halevy, Z.G. Ives, A.H. Doan
Pages205-216
Number of pages12
Publication statusPublished - 2003
Externally publishedYes
Event2003 ACM SIGMOD International Conference on Management of Data - San Diego, CA, United States
Duration: 2003 Jun 92003 Jun 12

Other

Other2003 ACM SIGMOD International Conference on Management of Data
CountryUnited States
CitySan Diego, CA
Period03/6/903/6/12

ASJC Scopus subject areas

  • Computer Science(all)

Cite this

Kang, J., & Naughton, J. F. (2003). On Schema Matching with Opaque Column Names and Data Values. In A. Y. Halevy, Z. G. Ives, & A. H. Doan (Eds.), Proceedings of the ACM SIGMOD International Conference on Management of Data (pp. 205-216)

On Schema Matching with Opaque Column Names and Data Values. / Kang, Jaewoo; Naughton, Jeffrey F.

Proceedings of the ACM SIGMOD International Conference on Management of Data. ed. / A.Y. Halevy; Z.G. Ives; A.H. Doan. 2003. p. 205-216.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Kang, J & Naughton, JF 2003, On Schema Matching with Opaque Column Names and Data Values. in AY Halevy, ZG Ives & AH Doan (eds), Proceedings of the ACM SIGMOD International Conference on Management of Data. pp. 205-216, 2003 ACM SIGMOD International Conference on Management of Data, San Diego, CA, United States, 03/6/9.
Kang J, Naughton JF. On Schema Matching with Opaque Column Names and Data Values. In Halevy AY, Ives ZG, Doan AH, editors, Proceedings of the ACM SIGMOD International Conference on Management of Data. 2003. p. 205-216
Kang, Jaewoo ; Naughton, Jeffrey F. / On Schema Matching with Opaque Column Names and Data Values. Proceedings of the ACM SIGMOD International Conference on Management of Data. editor / A.Y. Halevy ; Z.G. Ives ; A.H. Doan. 2003. pp. 205-216
@inproceedings{39fc67b8aabc49678c27e170baab9ff3,
title = "On Schema Matching with Opaque Column Names and Data Values",
abstract = "Most previous solutions to the schema matching problem rely in some fashion upon identifying {"}similar{"} column names in the schemas to be matched, or by recognizing common domains in the data stored in the schemas. While each of these approaches is valuable in many cases, they are not infallible, and there exist instances of the schema matching problem for which they do not even apply. Such problem instances typically arise when the column names in the schemas and the data in the columns are {"}opaque{"} or very difficult to interpret. In this paper we propose a two-step technique that works even in the presence of opaque column names and data values. In the first step, we measure the pair-wise attribute correlations in the tables to be matched and construct a dependency graph using mutual information as a measure of the dependency between attributes. In the second stage, we find matching node pairs in the dependency graphs by running a graph matching algorithm. We validate our approach with an experimental study, the results of which suggest that such an approach can be a useful addition to a set of (semi) automatic schema matching techniques.",
author = "Jaewoo Kang and Naughton, {Jeffrey F.}",
year = "2003",
language = "English",
pages = "205--216",
editor = "A.Y. Halevy and Z.G. Ives and A.H. Doan",
booktitle = "Proceedings of the ACM SIGMOD International Conference on Management of Data",

}

TY - GEN

T1 - On Schema Matching with Opaque Column Names and Data Values

AU - Kang, Jaewoo

AU - Naughton, Jeffrey F.

PY - 2003

Y1 - 2003

N2 - Most previous solutions to the schema matching problem rely in some fashion upon identifying "similar" column names in the schemas to be matched, or by recognizing common domains in the data stored in the schemas. While each of these approaches is valuable in many cases, they are not infallible, and there exist instances of the schema matching problem for which they do not even apply. Such problem instances typically arise when the column names in the schemas and the data in the columns are "opaque" or very difficult to interpret. In this paper we propose a two-step technique that works even in the presence of opaque column names and data values. In the first step, we measure the pair-wise attribute correlations in the tables to be matched and construct a dependency graph using mutual information as a measure of the dependency between attributes. In the second stage, we find matching node pairs in the dependency graphs by running a graph matching algorithm. We validate our approach with an experimental study, the results of which suggest that such an approach can be a useful addition to a set of (semi) automatic schema matching techniques.

AB - Most previous solutions to the schema matching problem rely in some fashion upon identifying "similar" column names in the schemas to be matched, or by recognizing common domains in the data stored in the schemas. While each of these approaches is valuable in many cases, they are not infallible, and there exist instances of the schema matching problem for which they do not even apply. Such problem instances typically arise when the column names in the schemas and the data in the columns are "opaque" or very difficult to interpret. In this paper we propose a two-step technique that works even in the presence of opaque column names and data values. In the first step, we measure the pair-wise attribute correlations in the tables to be matched and construct a dependency graph using mutual information as a measure of the dependency between attributes. In the second stage, we find matching node pairs in the dependency graphs by running a graph matching algorithm. We validate our approach with an experimental study, the results of which suggest that such an approach can be a useful addition to a set of (semi) automatic schema matching techniques.

UR - http://www.scopus.com/inward/record.url?scp=1142267348&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=1142267348&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:1142267348

SP - 205

EP - 216

BT - Proceedings of the ACM SIGMOD International Conference on Management of Data

A2 - Halevy, A.Y.

A2 - Ives, Z.G.

A2 - Doan, A.H.

ER -