TY - JOUR
T1 - Document clustering method using dimension reduction and support vector clustering to overcome sparseness
AU - Jun, Sunghae
AU - Park, Sang Sung
AU - Jang, Dong Sik
N1 - Funding Information:
This research was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science, and Technology (No. 2012-0026953). This work was supported by the Brain Korea 21 PLUS Project in 2013.
Copyright:
Copyright 2014 Elsevier B.V., All rights reserved.
PY - 2014/6/1
Y1 - 2014/6/1
N2 - Many studies on developing technologies have been published as articles, papers, or patents. We use and analyze these documents to find scientific and technological trends. In this paper, we consider document clustering as a method of document data analysis. In general, we have trouble analyzing documents directly because document data are not suitable for statistical and machine learning methods of analysis. Therefore, we have to transform document data into structured data for analytical purposes. For this process, we use text mining techniques. The structured data are very sparse, and hence, it is difficult to analyze them. This study proposes a new method to overcome the sparsity problem of document clustering. We build a combined clustering method using dimension reduction and K-means clustering based on support vector clustering and Silhouette measure. In particular, we attempt to overcome the sparseness in patent document clustering. To verify the efficacy of our work, we first conduct an experiment using news data from the machine learning repository of the University of California at Irvine. Second, using patent documents retrieved from the United States Patent and Trademark Office, we carry out patent clustering for technology forecasting.
AB - Many studies on developing technologies have been published as articles, papers, or patents. We use and analyze these documents to find scientific and technological trends. In this paper, we consider document clustering as a method of document data analysis. In general, we have trouble analyzing documents directly because document data are not suitable for statistical and machine learning methods of analysis. Therefore, we have to transform document data into structured data for analytical purposes. For this process, we use text mining techniques. The structured data are very sparse, and hence, it is difficult to analyze them. This study proposes a new method to overcome the sparsity problem of document clustering. We build a combined clustering method using dimension reduction and K-means clustering based on support vector clustering and Silhouette measure. In particular, we attempt to overcome the sparseness in patent document clustering. To verify the efficacy of our work, we first conduct an experiment using news data from the machine learning repository of the University of California at Irvine. Second, using patent documents retrieved from the United States Patent and Trademark Office, we carry out patent clustering for technology forecasting.
KW - Dimension reduction
KW - Document clustering
KW - K-means clustering based on support vector clustering
KW - Patent clustering
KW - Silhouette measure
KW - Sparseness problem
UR - http://www.scopus.com/inward/record.url?scp=84890497754&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84890497754&partnerID=8YFLogxK
U2 - 10.1016/j.eswa.2013.11.018
DO - 10.1016/j.eswa.2013.11.018
M3 - Article
AN - SCOPUS:84890497754
VL - 41
SP - 3204
EP - 3212
JO - Expert Systems with Applications
JF - Expert Systems with Applications
SN - 0957-4174
IS - 7
ER -