Document clustering method using dimension reduction and support vector clustering to overcome sparseness

Sunghae Jun, Sang Sung Park, Dong Sik Jang

Research output: Contribution to journalArticle

59 Citations (Scopus)

Abstract

Many studies on developing technologies have been published as articles, papers, or patents. We use and analyze these documents to find scientific and technological trends. In this paper, we consider document clustering as a method of document data analysis. In general, we have trouble analyzing documents directly because document data are not suitable for statistical and machine learning methods of analysis. Therefore, we have to transform document data into structured data for analytical purposes. For this process, we use text mining techniques. The structured data are very sparse, and hence, it is difficult to analyze them. This study proposes a new method to overcome the sparsity problem of document clustering. We build a combined clustering method using dimension reduction and K-means clustering based on support vector clustering and Silhouette measure. In particular, we attempt to overcome the sparseness in patent document clustering. To verify the efficacy of our work, we first conduct an experiment using news data from the machine learning repository of the University of California at Irvine. Second, using patent documents retrieved from the United States Patent and Trademark Office, we carry out patent clustering for technology forecasting.

Original languageEnglish
Pages (from-to)3204-3212
Number of pages9
JournalExpert Systems with Applications
Volume41
Issue number7
DOIs
Publication statusPublished - 2014 Jun 1

Fingerprint

Learning systems
Trademarks
Experiments

Keywords

  • Dimension reduction
  • Document clustering
  • K-means clustering based on support vector clustering
  • Patent clustering
  • Silhouette measure
  • Sparseness problem

ASJC Scopus subject areas

  • Artificial Intelligence
  • Computer Science Applications
  • Engineering(all)

Cite this

Document clustering method using dimension reduction and support vector clustering to overcome sparseness. / Jun, Sunghae; Park, Sang Sung; Jang, Dong Sik.

In: Expert Systems with Applications, Vol. 41, No. 7, 01.06.2014, p. 3204-3212.

Research output: Contribution to journalArticle

@article{dc4433eb5b084d77a80dab15b7d0fe30,
title = "Document clustering method using dimension reduction and support vector clustering to overcome sparseness",
abstract = "Many studies on developing technologies have been published as articles, papers, or patents. We use and analyze these documents to find scientific and technological trends. In this paper, we consider document clustering as a method of document data analysis. In general, we have trouble analyzing documents directly because document data are not suitable for statistical and machine learning methods of analysis. Therefore, we have to transform document data into structured data for analytical purposes. For this process, we use text mining techniques. The structured data are very sparse, and hence, it is difficult to analyze them. This study proposes a new method to overcome the sparsity problem of document clustering. We build a combined clustering method using dimension reduction and K-means clustering based on support vector clustering and Silhouette measure. In particular, we attempt to overcome the sparseness in patent document clustering. To verify the efficacy of our work, we first conduct an experiment using news data from the machine learning repository of the University of California at Irvine. Second, using patent documents retrieved from the United States Patent and Trademark Office, we carry out patent clustering for technology forecasting.",
keywords = "Dimension reduction, Document clustering, K-means clustering based on support vector clustering, Patent clustering, Silhouette measure, Sparseness problem",
author = "Sunghae Jun and Park, {Sang Sung} and Jang, {Dong Sik}",
year = "2014",
month = "6",
day = "1",
doi = "10.1016/j.eswa.2013.11.018",
language = "English",
volume = "41",
pages = "3204--3212",
journal = "Expert Systems with Applications",
issn = "0957-4174",
publisher = "Elsevier Limited",
number = "7",

}

TY - JOUR

T1 - Document clustering method using dimension reduction and support vector clustering to overcome sparseness

AU - Jun, Sunghae

AU - Park, Sang Sung

AU - Jang, Dong Sik

PY - 2014/6/1

Y1 - 2014/6/1

N2 - Many studies on developing technologies have been published as articles, papers, or patents. We use and analyze these documents to find scientific and technological trends. In this paper, we consider document clustering as a method of document data analysis. In general, we have trouble analyzing documents directly because document data are not suitable for statistical and machine learning methods of analysis. Therefore, we have to transform document data into structured data for analytical purposes. For this process, we use text mining techniques. The structured data are very sparse, and hence, it is difficult to analyze them. This study proposes a new method to overcome the sparsity problem of document clustering. We build a combined clustering method using dimension reduction and K-means clustering based on support vector clustering and Silhouette measure. In particular, we attempt to overcome the sparseness in patent document clustering. To verify the efficacy of our work, we first conduct an experiment using news data from the machine learning repository of the University of California at Irvine. Second, using patent documents retrieved from the United States Patent and Trademark Office, we carry out patent clustering for technology forecasting.

AB - Many studies on developing technologies have been published as articles, papers, or patents. We use and analyze these documents to find scientific and technological trends. In this paper, we consider document clustering as a method of document data analysis. In general, we have trouble analyzing documents directly because document data are not suitable for statistical and machine learning methods of analysis. Therefore, we have to transform document data into structured data for analytical purposes. For this process, we use text mining techniques. The structured data are very sparse, and hence, it is difficult to analyze them. This study proposes a new method to overcome the sparsity problem of document clustering. We build a combined clustering method using dimension reduction and K-means clustering based on support vector clustering and Silhouette measure. In particular, we attempt to overcome the sparseness in patent document clustering. To verify the efficacy of our work, we first conduct an experiment using news data from the machine learning repository of the University of California at Irvine. Second, using patent documents retrieved from the United States Patent and Trademark Office, we carry out patent clustering for technology forecasting.

KW - Dimension reduction

KW - Document clustering

KW - K-means clustering based on support vector clustering

KW - Patent clustering

KW - Silhouette measure

KW - Sparseness problem

UR - http://www.scopus.com/inward/record.url?scp=84890497754&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84890497754&partnerID=8YFLogxK

U2 - 10.1016/j.eswa.2013.11.018

DO - 10.1016/j.eswa.2013.11.018

M3 - Article

AN - SCOPUS:84890497754

VL - 41

SP - 3204

EP - 3212

JO - Expert Systems with Applications

JF - Expert Systems with Applications

SN - 0957-4174

IS - 7

ER -