Supervised paragraph vector: Distributed representations of words, documents and class labels

Eunjeong L. Park, Sungzoon Cho, Pilsung Kang

Research output: Contribution to journalArticle

2 Citations (Scopus)

Abstract

While the traditional method of deriving representations for documents was bag-of-words, they suffered from high dimensionality and sparsity. Recently, many methods to obtain lower dimensional and densely distributed representations were proposed. Paragraph Vector is one of such algorithms, which extends the word2vec algorithm by considering the paragraph as an additional word. However, it generates a single representation for all tasks, while different tasks may require different representations. In this paper, we propose a Supervised Paragraph Vector, a task-specific variant of Paragraph Vector for situations where class labels exist. Essentially, Supervised Paragraph Vector uses class labels along with words and documents and obtains corresponding representations with respect to the particular classification task. In order to prove the benefits of the proposed algorithm, three performance criteria are used: Interpretability, discriminative power, and computational efficiency. To test interpretability, we find words that are close and far to class vectors and demonstrate that such words are closely related to the corresponding class. We also use principal component analysis to visualize all words, documents, and class labels at the same time and show that our method effectively displays the related words and documents for each class label. To evaluate discriminative power and computational efficiency, we perform document classification on four commonly used datasets with various classifiers and achieve comparable classification accuracies to bag-of-words and Paragraph Vector.

Original languageEnglish
Article number8653834
Pages (from-to)29051-29064
Number of pages14
JournalIEEE Access
Volume7
DOIs
Publication statusPublished - 2019 Jan 1

Fingerprint

Labels
Computational efficiency
Principal component analysis
Classifiers

Keywords

  • Class label
  • distributed representations
  • document embedding
  • representation learning
  • word embedding

ASJC Scopus subject areas

  • Computer Science(all)
  • Materials Science(all)
  • Engineering(all)

Cite this

Supervised paragraph vector : Distributed representations of words, documents and class labels. / Park, Eunjeong L.; Cho, Sungzoon; Kang, Pilsung.

In: IEEE Access, Vol. 7, 8653834, 01.01.2019, p. 29051-29064.

Research output: Contribution to journalArticle

@article{28bb85d4eb2f4816a6f9c49840c83c03,
title = "Supervised paragraph vector: Distributed representations of words, documents and class labels",
abstract = "While the traditional method of deriving representations for documents was bag-of-words, they suffered from high dimensionality and sparsity. Recently, many methods to obtain lower dimensional and densely distributed representations were proposed. Paragraph Vector is one of such algorithms, which extends the word2vec algorithm by considering the paragraph as an additional word. However, it generates a single representation for all tasks, while different tasks may require different representations. In this paper, we propose a Supervised Paragraph Vector, a task-specific variant of Paragraph Vector for situations where class labels exist. Essentially, Supervised Paragraph Vector uses class labels along with words and documents and obtains corresponding representations with respect to the particular classification task. In order to prove the benefits of the proposed algorithm, three performance criteria are used: Interpretability, discriminative power, and computational efficiency. To test interpretability, we find words that are close and far to class vectors and demonstrate that such words are closely related to the corresponding class. We also use principal component analysis to visualize all words, documents, and class labels at the same time and show that our method effectively displays the related words and documents for each class label. To evaluate discriminative power and computational efficiency, we perform document classification on four commonly used datasets with various classifiers and achieve comparable classification accuracies to bag-of-words and Paragraph Vector.",
keywords = "Class label, distributed representations, document embedding, representation learning, word embedding",
author = "Park, {Eunjeong L.} and Sungzoon Cho and Pilsung Kang",
year = "2019",
month = "1",
day = "1",
doi = "10.1109/ACCESS.2019.2901933",
language = "English",
volume = "7",
pages = "29051--29064",
journal = "IEEE Access",
issn = "2169-3536",
publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - Supervised paragraph vector

T2 - Distributed representations of words, documents and class labels

AU - Park, Eunjeong L.

AU - Cho, Sungzoon

AU - Kang, Pilsung

PY - 2019/1/1

Y1 - 2019/1/1

N2 - While the traditional method of deriving representations for documents was bag-of-words, they suffered from high dimensionality and sparsity. Recently, many methods to obtain lower dimensional and densely distributed representations were proposed. Paragraph Vector is one of such algorithms, which extends the word2vec algorithm by considering the paragraph as an additional word. However, it generates a single representation for all tasks, while different tasks may require different representations. In this paper, we propose a Supervised Paragraph Vector, a task-specific variant of Paragraph Vector for situations where class labels exist. Essentially, Supervised Paragraph Vector uses class labels along with words and documents and obtains corresponding representations with respect to the particular classification task. In order to prove the benefits of the proposed algorithm, three performance criteria are used: Interpretability, discriminative power, and computational efficiency. To test interpretability, we find words that are close and far to class vectors and demonstrate that such words are closely related to the corresponding class. We also use principal component analysis to visualize all words, documents, and class labels at the same time and show that our method effectively displays the related words and documents for each class label. To evaluate discriminative power and computational efficiency, we perform document classification on four commonly used datasets with various classifiers and achieve comparable classification accuracies to bag-of-words and Paragraph Vector.

AB - While the traditional method of deriving representations for documents was bag-of-words, they suffered from high dimensionality and sparsity. Recently, many methods to obtain lower dimensional and densely distributed representations were proposed. Paragraph Vector is one of such algorithms, which extends the word2vec algorithm by considering the paragraph as an additional word. However, it generates a single representation for all tasks, while different tasks may require different representations. In this paper, we propose a Supervised Paragraph Vector, a task-specific variant of Paragraph Vector for situations where class labels exist. Essentially, Supervised Paragraph Vector uses class labels along with words and documents and obtains corresponding representations with respect to the particular classification task. In order to prove the benefits of the proposed algorithm, three performance criteria are used: Interpretability, discriminative power, and computational efficiency. To test interpretability, we find words that are close and far to class vectors and demonstrate that such words are closely related to the corresponding class. We also use principal component analysis to visualize all words, documents, and class labels at the same time and show that our method effectively displays the related words and documents for each class label. To evaluate discriminative power and computational efficiency, we perform document classification on four commonly used datasets with various classifiers and achieve comparable classification accuracies to bag-of-words and Paragraph Vector.

KW - Class label

KW - distributed representations

KW - document embedding

KW - representation learning

KW - word embedding

UR - http://www.scopus.com/inward/record.url?scp=85063262447&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85063262447&partnerID=8YFLogxK

U2 - 10.1109/ACCESS.2019.2901933

DO - 10.1109/ACCESS.2019.2901933

M3 - Article

AN - SCOPUS:85063262447

VL - 7

SP - 29051

EP - 29064

JO - IEEE Access

JF - IEEE Access

SN - 2169-3536

M1 - 8653834

ER -