Multi-pretraining for large-scale text classification

Kang Min Kim, Bumsu Hyeon, Yeachan Kim, Jun Hyung Park, Sang Keun Lee

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Deep neural network-based pretraining methods have achieved impressive results in many natural language processing tasks including text classification. However, their applicability to large-scale text classification with numerous categories (e.g., several thousands) is yet to be well-studied, where the training data is insufficient and skewed in terms of categories. In addition, existing pretraining methods usually involve excessive computation and memory overheads. In this paper, we develop a novel multi-pretraining framework for large-scale text classification. This multi-pretraining framework includes both a self-supervised pretraining and a weakly supervised pretraining. We newly introduce an out-of-context words detection task on the unlabeled data as the self-supervised pretraining. It captures the topic-consistency of words used in sentences, which is proven to be useful for text classification. In addition, we propose a weakly supervised pretraining, where labels for text classification are obtained automatically from an existing approach. Experimental results clearly show that both pretraining approaches are effective for large-scale text classification task. The proposed scheme exhibits significant improvements as much as 3.8% in terms of macro-averaging F1-score over strong pretraining methods, while being computationally efficient.

Original languageEnglish
Title of host publicationFindings of the Association for Computational Linguistics Findings of ACL
Subtitle of host publicationEMNLP 2020
PublisherAssociation for Computational Linguistics (ACL)
Pages2041-2050
Number of pages10
ISBN (Electronic)9781952148903
Publication statusPublished - 2020
EventFindings of the Association for Computational Linguistics, ACL 2020: EMNLP 2020 - Virtual, Online
Duration: 2020 Nov 162020 Nov 20

Publication series

NameFindings of the Association for Computational Linguistics Findings of ACL: EMNLP 2020

Conference

ConferenceFindings of the Association for Computational Linguistics, ACL 2020: EMNLP 2020
CityVirtual, Online
Period20/11/1620/11/20

ASJC Scopus subject areas

  • Information Systems
  • Computer Science Applications
  • Computational Theory and Mathematics

Fingerprint

Dive into the research topics of 'Multi-pretraining for large-scale text classification'. Together they form a unique fingerprint.

Cite this