A data-driven text similarity measure based on classification algorithms

Su Gon Cho, Seoung Bum Kim

Research output: Contribution to journalArticle

2 Citations (Scopus)

Abstract

Measuring text similarity has shown its fundamental utilization in various text mining application problems. This paper proposes a new method based on classification algorithms for measuring the similarity between two texts. Specifically, a sentence-term matrix that describes the frequency of terms that occur in a collection of sentences was created to measure the classification accuracy of two texts. Our idea is based on the fact that similar texts are difficult to distinguish from each other, which should lead to a low classification accuracy between similar texts. By doing comparative experiments on several widely used text similarity measures, analysis results with real data from the Machine Learning Repository at the University of California, Irvine demonstrate that the proposed method is able to achieve outperformed the other existing similarity measures across the entire range of term selection filters.

Original languageEnglish
Pages (from-to)328-339
Number of pages12
JournalInternational Journal of Industrial Engineering : Theory Applications and Practice
Volume24
Issue number3
Publication statusPublished - 2017

Fingerprint

Learning systems
Experiments

Keywords

  • Classification
  • Sentence-term matrix
  • Text mining
  • Text similarity measure

ASJC Scopus subject areas

  • Industrial and Manufacturing Engineering

Cite this

@article{f04910d936694a2e8987580d361c0d29,
title = "A data-driven text similarity measure based on classification algorithms",
abstract = "Measuring text similarity has shown its fundamental utilization in various text mining application problems. This paper proposes a new method based on classification algorithms for measuring the similarity between two texts. Specifically, a sentence-term matrix that describes the frequency of terms that occur in a collection of sentences was created to measure the classification accuracy of two texts. Our idea is based on the fact that similar texts are difficult to distinguish from each other, which should lead to a low classification accuracy between similar texts. By doing comparative experiments on several widely used text similarity measures, analysis results with real data from the Machine Learning Repository at the University of California, Irvine demonstrate that the proposed method is able to achieve outperformed the other existing similarity measures across the entire range of term selection filters.",
keywords = "Classification, Sentence-term matrix, Text mining, Text similarity measure",
author = "Cho, {Su Gon} and Kim, {Seoung Bum}",
year = "2017",
language = "English",
volume = "24",
pages = "328--339",
journal = "International Journal of Industrial Engineering : Theory Applications and Practice",
issn = "1072-4761",
publisher = "University of Cincinnati",
number = "3",

}

TY - JOUR

T1 - A data-driven text similarity measure based on classification algorithms

AU - Cho, Su Gon

AU - Kim, Seoung Bum

PY - 2017

Y1 - 2017

N2 - Measuring text similarity has shown its fundamental utilization in various text mining application problems. This paper proposes a new method based on classification algorithms for measuring the similarity between two texts. Specifically, a sentence-term matrix that describes the frequency of terms that occur in a collection of sentences was created to measure the classification accuracy of two texts. Our idea is based on the fact that similar texts are difficult to distinguish from each other, which should lead to a low classification accuracy between similar texts. By doing comparative experiments on several widely used text similarity measures, analysis results with real data from the Machine Learning Repository at the University of California, Irvine demonstrate that the proposed method is able to achieve outperformed the other existing similarity measures across the entire range of term selection filters.

AB - Measuring text similarity has shown its fundamental utilization in various text mining application problems. This paper proposes a new method based on classification algorithms for measuring the similarity between two texts. Specifically, a sentence-term matrix that describes the frequency of terms that occur in a collection of sentences was created to measure the classification accuracy of two texts. Our idea is based on the fact that similar texts are difficult to distinguish from each other, which should lead to a low classification accuracy between similar texts. By doing comparative experiments on several widely used text similarity measures, analysis results with real data from the Machine Learning Repository at the University of California, Irvine demonstrate that the proposed method is able to achieve outperformed the other existing similarity measures across the entire range of term selection filters.

KW - Classification

KW - Sentence-term matrix

KW - Text mining

KW - Text similarity measure

UR - http://www.scopus.com/inward/record.url?scp=85032832225&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85032832225&partnerID=8YFLogxK

M3 - Article

AN - SCOPUS:85032832225

VL - 24

SP - 328

EP - 339

JO - International Journal of Industrial Engineering : Theory Applications and Practice

JF - International Journal of Industrial Engineering : Theory Applications and Practice

SN - 1072-4761

IS - 3

ER -