Incorporating word embeddings into open directory project based large-scale classification

Kang Min Kim, Aliyeva Dinara, Byung Ju Choi, Sang-Geun Lee

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Citation (Scopus)

Abstract

Recently, implicit representation models, such as embedding or deep learning, have been successfully adopted to text classification task due to their outstanding performance. However, these approaches are limited to small- or moderate-scale text classification. Explicit representation models are often used in a large-scale text classification, like the Open Directory Project (ODP)-based text classification. However, the performance of these models is limited to the associated knowledge bases. In this paper, we incorporate word embeddings into the ODP-based large-scale classification. To this end, we first generate category vectors, which represent the semantics of ODP categories by jointly modeling word embeddings and the ODP-based text classification. We then propose a novel semantic similarity measure, which utilizes the category and word vectors obtained from the joint model. The evaluation results clearly show the efficacy of our methodology in large-scale text classification. The proposed scheme exhibits significant improvements of 10% and 28% in terms of macro-averaging F1-score and precision at k, respectively, over state-of-the-art techniques.

Original languageEnglish
Title of host publicationAdvances in Knowledge Discovery and Data Mining - 22nd Pacific-Asia Conference, PAKDD 2018, Proceedings
EditorsBao Ho, Dinh Phung, Geoffrey I. Webb, Vincent S. Tseng, Mohadeseh Ganji, Lida Rashidi
PublisherSpringer Verlag
Pages376-388
Number of pages13
ISBN (Print)9783319930367
DOIs
Publication statusPublished - 2018 Jan 1
Event22nd Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, PAKDD 2018 - Melbourne, Australia
Duration: 2018 Jun 32018 Jun 6

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume10938 LNAI
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Other

Other22nd Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, PAKDD 2018
CountryAustralia
CityMelbourne
Period18/6/318/6/6

Fingerprint

Text Classification
Semantics
Joint Model
Semantic Similarity
Similarity Measure
Knowledge Base
Averaging
Efficacy
Macros
Model
Methodology
Evaluation
Modeling

Keywords

  • Text classification
  • Word embeddings

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Computer Science(all)

Cite this

Kim, K. M., Dinara, A., Choi, B. J., & Lee, S-G. (2018). Incorporating word embeddings into open directory project based large-scale classification. In B. Ho, D. Phung, G. I. Webb, V. S. Tseng, M. Ganji, & L. Rashidi (Eds.), Advances in Knowledge Discovery and Data Mining - 22nd Pacific-Asia Conference, PAKDD 2018, Proceedings (pp. 376-388). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 10938 LNAI). Springer Verlag. https://doi.org/10.1007/978-3-319-93037-4_30

Incorporating word embeddings into open directory project based large-scale classification. / Kim, Kang Min; Dinara, Aliyeva; Choi, Byung Ju; Lee, Sang-Geun.

Advances in Knowledge Discovery and Data Mining - 22nd Pacific-Asia Conference, PAKDD 2018, Proceedings. ed. / Bao Ho; Dinh Phung; Geoffrey I. Webb; Vincent S. Tseng; Mohadeseh Ganji; Lida Rashidi. Springer Verlag, 2018. p. 376-388 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 10938 LNAI).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Kim, KM, Dinara, A, Choi, BJ & Lee, S-G 2018, Incorporating word embeddings into open directory project based large-scale classification. in B Ho, D Phung, GI Webb, VS Tseng, M Ganji & L Rashidi (eds), Advances in Knowledge Discovery and Data Mining - 22nd Pacific-Asia Conference, PAKDD 2018, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 10938 LNAI, Springer Verlag, pp. 376-388, 22nd Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, PAKDD 2018, Melbourne, Australia, 18/6/3. https://doi.org/10.1007/978-3-319-93037-4_30
Kim KM, Dinara A, Choi BJ, Lee S-G. Incorporating word embeddings into open directory project based large-scale classification. In Ho B, Phung D, Webb GI, Tseng VS, Ganji M, Rashidi L, editors, Advances in Knowledge Discovery and Data Mining - 22nd Pacific-Asia Conference, PAKDD 2018, Proceedings. Springer Verlag. 2018. p. 376-388. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). https://doi.org/10.1007/978-3-319-93037-4_30
Kim, Kang Min ; Dinara, Aliyeva ; Choi, Byung Ju ; Lee, Sang-Geun. / Incorporating word embeddings into open directory project based large-scale classification. Advances in Knowledge Discovery and Data Mining - 22nd Pacific-Asia Conference, PAKDD 2018, Proceedings. editor / Bao Ho ; Dinh Phung ; Geoffrey I. Webb ; Vincent S. Tseng ; Mohadeseh Ganji ; Lida Rashidi. Springer Verlag, 2018. pp. 376-388 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
@inproceedings{2e2262369e4b4969bf153b80fa7c52fd,
title = "Incorporating word embeddings into open directory project based large-scale classification",
abstract = "Recently, implicit representation models, such as embedding or deep learning, have been successfully adopted to text classification task due to their outstanding performance. However, these approaches are limited to small- or moderate-scale text classification. Explicit representation models are often used in a large-scale text classification, like the Open Directory Project (ODP)-based text classification. However, the performance of these models is limited to the associated knowledge bases. In this paper, we incorporate word embeddings into the ODP-based large-scale classification. To this end, we first generate category vectors, which represent the semantics of ODP categories by jointly modeling word embeddings and the ODP-based text classification. We then propose a novel semantic similarity measure, which utilizes the category and word vectors obtained from the joint model. The evaluation results clearly show the efficacy of our methodology in large-scale text classification. The proposed scheme exhibits significant improvements of 10{\%} and 28{\%} in terms of macro-averaging F1-score and precision at k, respectively, over state-of-the-art techniques.",
keywords = "Text classification, Word embeddings",
author = "Kim, {Kang Min} and Aliyeva Dinara and Choi, {Byung Ju} and Sang-Geun Lee",
year = "2018",
month = "1",
day = "1",
doi = "10.1007/978-3-319-93037-4_30",
language = "English",
isbn = "9783319930367",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
publisher = "Springer Verlag",
pages = "376--388",
editor = "Bao Ho and Dinh Phung and Webb, {Geoffrey I.} and Tseng, {Vincent S.} and Mohadeseh Ganji and Lida Rashidi",
booktitle = "Advances in Knowledge Discovery and Data Mining - 22nd Pacific-Asia Conference, PAKDD 2018, Proceedings",

}

TY - GEN

T1 - Incorporating word embeddings into open directory project based large-scale classification

AU - Kim, Kang Min

AU - Dinara, Aliyeva

AU - Choi, Byung Ju

AU - Lee, Sang-Geun

PY - 2018/1/1

Y1 - 2018/1/1

N2 - Recently, implicit representation models, such as embedding or deep learning, have been successfully adopted to text classification task due to their outstanding performance. However, these approaches are limited to small- or moderate-scale text classification. Explicit representation models are often used in a large-scale text classification, like the Open Directory Project (ODP)-based text classification. However, the performance of these models is limited to the associated knowledge bases. In this paper, we incorporate word embeddings into the ODP-based large-scale classification. To this end, we first generate category vectors, which represent the semantics of ODP categories by jointly modeling word embeddings and the ODP-based text classification. We then propose a novel semantic similarity measure, which utilizes the category and word vectors obtained from the joint model. The evaluation results clearly show the efficacy of our methodology in large-scale text classification. The proposed scheme exhibits significant improvements of 10% and 28% in terms of macro-averaging F1-score and precision at k, respectively, over state-of-the-art techniques.

AB - Recently, implicit representation models, such as embedding or deep learning, have been successfully adopted to text classification task due to their outstanding performance. However, these approaches are limited to small- or moderate-scale text classification. Explicit representation models are often used in a large-scale text classification, like the Open Directory Project (ODP)-based text classification. However, the performance of these models is limited to the associated knowledge bases. In this paper, we incorporate word embeddings into the ODP-based large-scale classification. To this end, we first generate category vectors, which represent the semantics of ODP categories by jointly modeling word embeddings and the ODP-based text classification. We then propose a novel semantic similarity measure, which utilizes the category and word vectors obtained from the joint model. The evaluation results clearly show the efficacy of our methodology in large-scale text classification. The proposed scheme exhibits significant improvements of 10% and 28% in terms of macro-averaging F1-score and precision at k, respectively, over state-of-the-art techniques.

KW - Text classification

KW - Word embeddings

UR - http://www.scopus.com/inward/record.url?scp=85049372495&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85049372495&partnerID=8YFLogxK

U2 - 10.1007/978-3-319-93037-4_30

DO - 10.1007/978-3-319-93037-4_30

M3 - Conference contribution

AN - SCOPUS:85049372495

SN - 9783319930367

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 376

EP - 388

BT - Advances in Knowledge Discovery and Data Mining - 22nd Pacific-Asia Conference, PAKDD 2018, Proceedings

A2 - Ho, Bao

A2 - Phung, Dinh

A2 - Webb, Geoffrey I.

A2 - Tseng, Vincent S.

A2 - Ganji, Mohadeseh

A2 - Rashidi, Lida

PB - Springer Verlag

ER -