TY - JOUR
T1 - Multi-co-training for document classification using various document representations
T2 - TF–IDF, LDA, and Doc2Vec
AU - Kim, Donghwa
AU - Seo, Deokseong
AU - Cho, Suhyoun
AU - Kang, Pilsung
N1 - Funding Information:
This work was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education ( NRF-2016R1D1A1B03930729 ) and Institute for Information & Communications Technology Promotion ( IITP ) grant funded by the Korean Government ( MSIP ) (No. 2017-0-00349 ), Development of Media Streaming system with Machine Learning using QoE (Quality of Experience). This work was also supported by Korea Electric Power Corporation. (Grant number: R18XA05).
Funding Information:
This work was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (NRF-2016R1D1A1B03930729) and Institute for Information & Communications Technology Promotion (IITP) grant funded by the Korean Government (MSIP) (No. 2017-0-00349), Development of Media Streaming system with Machine Learning using QoE (Quality of Experience). This work was also supported by Korea Electric Power Corporation. (Grant number: R18XA05).
PY - 2019/3
Y1 - 2019/3
N2 - The purpose of document classification is to assign the most appropriate label to a specified document. The main challenges in document classification are insufficient label information and unstructured sparse format. A semi-supervised learning (SSL) approach could be an effective solution to the former problem, whereas the consideration of multiple document representation schemes can resolve the latter problem. Co-training is a popular SSL method that attempts to exploit various perspectives in terms of feature subsets for the same example. In this paper, we propose multi-co-training (MCT) for improving the performance of document classification. In order to increase the variety of feature sets for classification, we transform a document using three document representation methods: term frequency–inverse document frequency (TF–IDF) based on the bag-of-words scheme, topic distribution based on latent Dirichlet allocation (LDA), and neural-network-based document embedding known as document to vector (Doc2Vec). The experimental results demonstrate that the proposed MCT is robust to parameter changes and outperforms benchmark methods under various conditions.
AB - The purpose of document classification is to assign the most appropriate label to a specified document. The main challenges in document classification are insufficient label information and unstructured sparse format. A semi-supervised learning (SSL) approach could be an effective solution to the former problem, whereas the consideration of multiple document representation schemes can resolve the latter problem. Co-training is a popular SSL method that attempts to exploit various perspectives in terms of feature subsets for the same example. In this paper, we propose multi-co-training (MCT) for improving the performance of document classification. In order to increase the variety of feature sets for classification, we transform a document using three document representation methods: term frequency–inverse document frequency (TF–IDF) based on the bag-of-words scheme, topic distribution based on latent Dirichlet allocation (LDA), and neural-network-based document embedding known as document to vector (Doc2Vec). The experimental results demonstrate that the proposed MCT is robust to parameter changes and outperforms benchmark methods under various conditions.
KW - Co-training
KW - Doc2vec
KW - Document classification
KW - LDA
KW - Semi-supervised learning
KW - TF–IDF
UR - http://www.scopus.com/inward/record.url?scp=85055422604&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85055422604&partnerID=8YFLogxK
U2 - 10.1016/j.ins.2018.10.006
DO - 10.1016/j.ins.2018.10.006
M3 - Article
AN - SCOPUS:85055422604
VL - 477
SP - 15
EP - 29
JO - Information Sciences
JF - Information Sciences
SN - 0020-0255
ER -