In Korean information retrieval, syntactic term mismatches between index terms and query terms have been a serious obstacle to the enhancement of retrieval performance. Conventional approaches try to alleviate syntactic term mismatches either by segmenting compound nouns or by normalizing different representation of noun phrases. However, using only the segmentation may cause similarity measurements to increase unnecessarily since the segmented unit nouns can't discriminate different formations of compound nouns. On the other hand, using only the normalization has a limit in alleviating syntactic term mismatches because of the specificity of normalized phrases. In this paper, we propose a Korean information retrieval system which can alleviate syntactic term mismatches by segmenting compound nouns as well as by normalizing noun phrases, and which can provide appropriate similarity measurements. In the indexing module, we segment compound nouns by statistical information and normalize noun phrases by dependency relations. Then, we extract terms attached with boundary information. Finally, terms are weighted by a newly devised weighting scheme appropriate for Korean noun phrases. In the retrieval module, we compute the similarity considering partial matching by using boundary information. The experimental results show that the proposed method can alleviate syntactic term mismatches and improve the precision without decreasing the recall.
ASJC Scopus subject areas
- Computer Science Applications
- Information Systems
- Library and Information Sciences