Indexing by Latent Dirichlet Allocation and an Ensemble Model

Yanshan Wang, Jae Sung Lee, In Chan Choi

Research output: Contribution to journalArticle

15 Citations (Scopus)

Abstract

The contribution of this article is twofold. First, we present Indexing by latent Dirichlet allocation (LDI), an automatic document indexing method. Many ad hoc applications, or their variants with smoothing techniques suggested in LDA-based language modeling, can result in unsatisfactory performance as the document representations do not accurately reflect concept space. To improve document retrieval performance, we introduce a new definition of document probability vectors in the context of LDA and present a novel scheme for automatic document indexing based on LDA. Second, we propose an Ensemble Model (EnM) for document retrieval. EnM combines basic indexing models by assigning different weights and attempts to uncover the optimal weights to maximize the mean average precision. To solve the optimization problem, we propose an algorithm, which is derived based on the boosting method. The results of our computational experiments on benchmark data sets indicate that both the proposed approaches are viable options for document retrieval.

Original languageEnglish
Pages (from-to)1736-1750
Number of pages15
JournalJournal of the Association for Information Science and Technology
Volume67
Issue number7
DOIs
Publication statusPublished - 2016 Jul 1

Fingerprint

indexing
Indexing
Dirichlet
performance
Experiments
present
experiment
language

Keywords

  • information retrieval
  • machine learning
  • searching

ASJC Scopus subject areas

  • Information Systems and Management
  • Library and Information Sciences
  • Computer Networks and Communications
  • Information Systems

Cite this

Indexing by Latent Dirichlet Allocation and an Ensemble Model. / Wang, Yanshan; Lee, Jae Sung; Choi, In Chan.

In: Journal of the Association for Information Science and Technology, Vol. 67, No. 7, 01.07.2016, p. 1736-1750.

Research output: Contribution to journalArticle

@article{95838f4d215d47cdad987e181f07bc44,
title = "Indexing by Latent Dirichlet Allocation and an Ensemble Model",
abstract = "The contribution of this article is twofold. First, we present Indexing by latent Dirichlet allocation (LDI), an automatic document indexing method. Many ad hoc applications, or their variants with smoothing techniques suggested in LDA-based language modeling, can result in unsatisfactory performance as the document representations do not accurately reflect concept space. To improve document retrieval performance, we introduce a new definition of document probability vectors in the context of LDA and present a novel scheme for automatic document indexing based on LDA. Second, we propose an Ensemble Model (EnM) for document retrieval. EnM combines basic indexing models by assigning different weights and attempts to uncover the optimal weights to maximize the mean average precision. To solve the optimization problem, we propose an algorithm, which is derived based on the boosting method. The results of our computational experiments on benchmark data sets indicate that both the proposed approaches are viable options for document retrieval.",
keywords = "information retrieval, machine learning, searching",
author = "Yanshan Wang and Lee, {Jae Sung} and Choi, {In Chan}",
year = "2016",
month = "7",
day = "1",
doi = "10.1002/asi.23444",
language = "English",
volume = "67",
pages = "1736--1750",
journal = "Journal of the Association for Information Science and Technology",
issn = "2330-1635",
publisher = "John Wiley and Sons Ltd",
number = "7",

}

TY - JOUR

T1 - Indexing by Latent Dirichlet Allocation and an Ensemble Model

AU - Wang, Yanshan

AU - Lee, Jae Sung

AU - Choi, In Chan

PY - 2016/7/1

Y1 - 2016/7/1

N2 - The contribution of this article is twofold. First, we present Indexing by latent Dirichlet allocation (LDI), an automatic document indexing method. Many ad hoc applications, or their variants with smoothing techniques suggested in LDA-based language modeling, can result in unsatisfactory performance as the document representations do not accurately reflect concept space. To improve document retrieval performance, we introduce a new definition of document probability vectors in the context of LDA and present a novel scheme for automatic document indexing based on LDA. Second, we propose an Ensemble Model (EnM) for document retrieval. EnM combines basic indexing models by assigning different weights and attempts to uncover the optimal weights to maximize the mean average precision. To solve the optimization problem, we propose an algorithm, which is derived based on the boosting method. The results of our computational experiments on benchmark data sets indicate that both the proposed approaches are viable options for document retrieval.

AB - The contribution of this article is twofold. First, we present Indexing by latent Dirichlet allocation (LDI), an automatic document indexing method. Many ad hoc applications, or their variants with smoothing techniques suggested in LDA-based language modeling, can result in unsatisfactory performance as the document representations do not accurately reflect concept space. To improve document retrieval performance, we introduce a new definition of document probability vectors in the context of LDA and present a novel scheme for automatic document indexing based on LDA. Second, we propose an Ensemble Model (EnM) for document retrieval. EnM combines basic indexing models by assigning different weights and attempts to uncover the optimal weights to maximize the mean average precision. To solve the optimization problem, we propose an algorithm, which is derived based on the boosting method. The results of our computational experiments on benchmark data sets indicate that both the proposed approaches are viable options for document retrieval.

KW - information retrieval

KW - machine learning

KW - searching

UR - http://www.scopus.com/inward/record.url?scp=84973890989&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84973890989&partnerID=8YFLogxK

U2 - 10.1002/asi.23444

DO - 10.1002/asi.23444

M3 - Article

AN - SCOPUS:84973890989

VL - 67

SP - 1736

EP - 1750

JO - Journal of the Association for Information Science and Technology

JF - Journal of the Association for Information Science and Technology

SN - 2330-1635

IS - 7

ER -