Automatic stop word generation for mining software artifact using topic model with pointwise mutual information

Jung Been Lee, Taek Lee, Hoh In

Research output: Contribution to journalArticle

Abstract

Mining software artifacts is a useful way to understand the source code of software projects. Topic modeling in particular has been widely used to discover meaningful information from software artifacts. However, software artifacts are unstructured and contain a mix of textual types within the natural text. These software artifact characteristics worsen the performance of topic modeling. Among several natural language preprocessing tasks, removing stop words to reduce meaningless and uninteresting terms is an efficient way to improve the quality of topic models. Although many approaches are used to generate effective stop words, the lists are outdated or too general to apply to mining software artifacts. In addition, the performance of the topic model is sensitive to the datasets used in the training for each approach. To resolve these problems, we propose an automatic stop word generation approach for topic models of software artifacts. By measuring topic coherence among words in the topic using Pointwise Mutual Information (PMI), we added words with a low PMI score to our stop words list for every topic modeling loop. Through our experiment, we proved that our stop words list results in a higher performance of the topic model than lists from other approaches.

Original languageEnglish
Pages (from-to)1761-1772
Number of pages12
JournalIEICE Transactions on Information and Systems
VolumeE102D
Issue number9
DOIs
Publication statusPublished - 2019 Jan 1

Fingerprint

Experiments

Keywords

  • Pointwise Mutual Information (PMI)
  • Software artifact
  • Stop words
  • Text mining
  • Topic modeling

ASJC Scopus subject areas

  • Software
  • Hardware and Architecture
  • Computer Vision and Pattern Recognition
  • Electrical and Electronic Engineering
  • Artificial Intelligence

Cite this

Automatic stop word generation for mining software artifact using topic model with pointwise mutual information. / Lee, Jung Been; Lee, Taek; In, Hoh.

In: IEICE Transactions on Information and Systems, Vol. E102D, No. 9, 01.01.2019, p. 1761-1772.

Research output: Contribution to journalArticle

@article{f3041bc5afe64f9fad567252e235b86b,
title = "Automatic stop word generation for mining software artifact using topic model with pointwise mutual information",
abstract = "Mining software artifacts is a useful way to understand the source code of software projects. Topic modeling in particular has been widely used to discover meaningful information from software artifacts. However, software artifacts are unstructured and contain a mix of textual types within the natural text. These software artifact characteristics worsen the performance of topic modeling. Among several natural language preprocessing tasks, removing stop words to reduce meaningless and uninteresting terms is an efficient way to improve the quality of topic models. Although many approaches are used to generate effective stop words, the lists are outdated or too general to apply to mining software artifacts. In addition, the performance of the topic model is sensitive to the datasets used in the training for each approach. To resolve these problems, we propose an automatic stop word generation approach for topic models of software artifacts. By measuring topic coherence among words in the topic using Pointwise Mutual Information (PMI), we added words with a low PMI score to our stop words list for every topic modeling loop. Through our experiment, we proved that our stop words list results in a higher performance of the topic model than lists from other approaches.",
keywords = "Pointwise Mutual Information (PMI), Software artifact, Stop words, Text mining, Topic modeling",
author = "Lee, {Jung Been} and Taek Lee and Hoh In",
year = "2019",
month = "1",
day = "1",
doi = "10.1587/transinf.2018EDP7390",
language = "English",
volume = "E102D",
pages = "1761--1772",
journal = "IEICE Transactions on Information and Systems",
issn = "0916-8532",
publisher = "Maruzen Co., Ltd/Maruzen Kabushikikaisha",
number = "9",

}

TY - JOUR

T1 - Automatic stop word generation for mining software artifact using topic model with pointwise mutual information

AU - Lee, Jung Been

AU - Lee, Taek

AU - In, Hoh

PY - 2019/1/1

Y1 - 2019/1/1

N2 - Mining software artifacts is a useful way to understand the source code of software projects. Topic modeling in particular has been widely used to discover meaningful information from software artifacts. However, software artifacts are unstructured and contain a mix of textual types within the natural text. These software artifact characteristics worsen the performance of topic modeling. Among several natural language preprocessing tasks, removing stop words to reduce meaningless and uninteresting terms is an efficient way to improve the quality of topic models. Although many approaches are used to generate effective stop words, the lists are outdated or too general to apply to mining software artifacts. In addition, the performance of the topic model is sensitive to the datasets used in the training for each approach. To resolve these problems, we propose an automatic stop word generation approach for topic models of software artifacts. By measuring topic coherence among words in the topic using Pointwise Mutual Information (PMI), we added words with a low PMI score to our stop words list for every topic modeling loop. Through our experiment, we proved that our stop words list results in a higher performance of the topic model than lists from other approaches.

AB - Mining software artifacts is a useful way to understand the source code of software projects. Topic modeling in particular has been widely used to discover meaningful information from software artifacts. However, software artifacts are unstructured and contain a mix of textual types within the natural text. These software artifact characteristics worsen the performance of topic modeling. Among several natural language preprocessing tasks, removing stop words to reduce meaningless and uninteresting terms is an efficient way to improve the quality of topic models. Although many approaches are used to generate effective stop words, the lists are outdated or too general to apply to mining software artifacts. In addition, the performance of the topic model is sensitive to the datasets used in the training for each approach. To resolve these problems, we propose an automatic stop word generation approach for topic models of software artifacts. By measuring topic coherence among words in the topic using Pointwise Mutual Information (PMI), we added words with a low PMI score to our stop words list for every topic modeling loop. Through our experiment, we proved that our stop words list results in a higher performance of the topic model than lists from other approaches.

KW - Pointwise Mutual Information (PMI)

KW - Software artifact

KW - Stop words

KW - Text mining

KW - Topic modeling

UR - http://www.scopus.com/inward/record.url?scp=85071942456&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85071942456&partnerID=8YFLogxK

U2 - 10.1587/transinf.2018EDP7390

DO - 10.1587/transinf.2018EDP7390

M3 - Article

VL - E102D

SP - 1761

EP - 1772

JO - IEICE Transactions on Information and Systems

JF - IEICE Transactions on Information and Systems

SN - 0916-8532

IS - 9

ER -