Multi-co-training for document classification using various document representations: TF–IDF, LDA, and Doc2Vec

Donghwa Kim, Deokseong Seo, Suhyoun Cho, Pilsung Kang

Research output: Contribution to journalArticle

19 Citations (Scopus)

Abstract

The purpose of document classification is to assign the most appropriate label to a specified document. The main challenges in document classification are insufficient label information and unstructured sparse format. A semi-supervised learning (SSL) approach could be an effective solution to the former problem, whereas the consideration of multiple document representation schemes can resolve the latter problem. Co-training is a popular SSL method that attempts to exploit various perspectives in terms of feature subsets for the same example. In this paper, we propose multi-co-training (MCT) for improving the performance of document classification. In order to increase the variety of feature sets for classification, we transform a document using three document representation methods: term frequency–inverse document frequency (TF–IDF) based on the bag-of-words scheme, topic distribution based on latent Dirichlet allocation (LDA), and neural-network-based document embedding known as document to vector (Doc2Vec). The experimental results demonstrate that the proposed MCT is robust to parameter changes and outperforms benchmark methods under various conditions.

Original languageEnglish
Pages (from-to)15-29
Number of pages15
JournalInformation Sciences
Volume477
DOIs
Publication statusPublished - 2019 Mar 1

Fingerprint

Co-training
Document Classification
Dirichlet
Semi-supervised Learning
Supervised learning
Term
Labels
Assign
Resolve
Neural Networks
Transform
Benchmark
Subset
Experimental Results
Neural networks
Demonstrate
Document classification
Semi-supervised learning

Keywords

  • Co-training
  • Doc2vec
  • Document classification
  • LDA
  • Semi-supervised learning
  • TF–IDF

ASJC Scopus subject areas

  • Software
  • Control and Systems Engineering
  • Theoretical Computer Science
  • Computer Science Applications
  • Information Systems and Management
  • Artificial Intelligence

Cite this

Multi-co-training for document classification using various document representations : TF–IDF, LDA, and Doc2Vec. / Kim, Donghwa; Seo, Deokseong; Cho, Suhyoun; Kang, Pilsung.

In: Information Sciences, Vol. 477, 01.03.2019, p. 15-29.

Research output: Contribution to journalArticle

@article{6cf641329788416890b268858b0f9680,
title = "Multi-co-training for document classification using various document representations: TF–IDF, LDA, and Doc2Vec",
abstract = "The purpose of document classification is to assign the most appropriate label to a specified document. The main challenges in document classification are insufficient label information and unstructured sparse format. A semi-supervised learning (SSL) approach could be an effective solution to the former problem, whereas the consideration of multiple document representation schemes can resolve the latter problem. Co-training is a popular SSL method that attempts to exploit various perspectives in terms of feature subsets for the same example. In this paper, we propose multi-co-training (MCT) for improving the performance of document classification. In order to increase the variety of feature sets for classification, we transform a document using three document representation methods: term frequency–inverse document frequency (TF–IDF) based on the bag-of-words scheme, topic distribution based on latent Dirichlet allocation (LDA), and neural-network-based document embedding known as document to vector (Doc2Vec). The experimental results demonstrate that the proposed MCT is robust to parameter changes and outperforms benchmark methods under various conditions.",
keywords = "Co-training, Doc2vec, Document classification, LDA, Semi-supervised learning, TF–IDF",
author = "Donghwa Kim and Deokseong Seo and Suhyoun Cho and Pilsung Kang",
year = "2019",
month = "3",
day = "1",
doi = "10.1016/j.ins.2018.10.006",
language = "English",
volume = "477",
pages = "15--29",
journal = "Information Sciences",
issn = "0020-0255",
publisher = "Elsevier Inc.",

}

TY - JOUR

T1 - Multi-co-training for document classification using various document representations

T2 - TF–IDF, LDA, and Doc2Vec

AU - Kim, Donghwa

AU - Seo, Deokseong

AU - Cho, Suhyoun

AU - Kang, Pilsung

PY - 2019/3/1

Y1 - 2019/3/1

N2 - The purpose of document classification is to assign the most appropriate label to a specified document. The main challenges in document classification are insufficient label information and unstructured sparse format. A semi-supervised learning (SSL) approach could be an effective solution to the former problem, whereas the consideration of multiple document representation schemes can resolve the latter problem. Co-training is a popular SSL method that attempts to exploit various perspectives in terms of feature subsets for the same example. In this paper, we propose multi-co-training (MCT) for improving the performance of document classification. In order to increase the variety of feature sets for classification, we transform a document using three document representation methods: term frequency–inverse document frequency (TF–IDF) based on the bag-of-words scheme, topic distribution based on latent Dirichlet allocation (LDA), and neural-network-based document embedding known as document to vector (Doc2Vec). The experimental results demonstrate that the proposed MCT is robust to parameter changes and outperforms benchmark methods under various conditions.

AB - The purpose of document classification is to assign the most appropriate label to a specified document. The main challenges in document classification are insufficient label information and unstructured sparse format. A semi-supervised learning (SSL) approach could be an effective solution to the former problem, whereas the consideration of multiple document representation schemes can resolve the latter problem. Co-training is a popular SSL method that attempts to exploit various perspectives in terms of feature subsets for the same example. In this paper, we propose multi-co-training (MCT) for improving the performance of document classification. In order to increase the variety of feature sets for classification, we transform a document using three document representation methods: term frequency–inverse document frequency (TF–IDF) based on the bag-of-words scheme, topic distribution based on latent Dirichlet allocation (LDA), and neural-network-based document embedding known as document to vector (Doc2Vec). The experimental results demonstrate that the proposed MCT is robust to parameter changes and outperforms benchmark methods under various conditions.

KW - Co-training

KW - Doc2vec

KW - Document classification

KW - LDA

KW - Semi-supervised learning

KW - TF–IDF

UR - http://www.scopus.com/inward/record.url?scp=85055422604&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85055422604&partnerID=8YFLogxK

U2 - 10.1016/j.ins.2018.10.006

DO - 10.1016/j.ins.2018.10.006

M3 - Article

AN - SCOPUS:85055422604

VL - 477

SP - 15

EP - 29

JO - Information Sciences

JF - Information Sciences

SN - 0020-0255

ER -