Automated classification of industry and occupation codes using document classification method

Research output: Contribution to journalArticle

Abstract

This paper describes development of the automated industry and occupation coding system for the Korean Census records. The purpose of the system is to convert natural language responses on survey questionnaires into corresponding numeric codes according to standard code book from the Census Bureau. We employ kNN(k Nearest Neighbors)-based document classification method and information retrieval techniques to index and to weight index terms. In order to solve the description inconsistency of many respondents, we use nouns and phrases acquired from past census data. Using the data, we could estimate the nouns or phrases frequently used to describe a certain code. The Experimental results show that the past census data plays an important role in increasing code classification accuracy.

Original languageEnglish
Pages (from-to)827-833
Number of pages7
JournalLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume3316
Publication statusPublished - 2004 Dec 1

Fingerprint

Document Classification
Census
Censuses
Occupations
Industry
Information retrieval
Information Storage and Retrieval
Numerics
Inconsistency
Questionnaire
Information Retrieval
Natural Language
Convert
Nearest Neighbor
Language
Coding
Weights and Measures
Experimental Results
Term
Estimate

ASJC Scopus subject areas

  • Computer Science(all)
  • Biochemistry, Genetics and Molecular Biology(all)
  • Theoretical Computer Science

Cite this

@article{699b1ef8078948b9a3d921c7983ad772,
title = "Automated classification of industry and occupation codes using document classification method",
abstract = "This paper describes development of the automated industry and occupation coding system for the Korean Census records. The purpose of the system is to convert natural language responses on survey questionnaires into corresponding numeric codes according to standard code book from the Census Bureau. We employ kNN(k Nearest Neighbors)-based document classification method and information retrieval techniques to index and to weight index terms. In order to solve the description inconsistency of many respondents, we use nouns and phrases acquired from past census data. Using the data, we could estimate the nouns or phrases frequently used to describe a certain code. The Experimental results show that the past census data plays an important role in increasing code classification accuracy.",
author = "Lim, {Heui Seok} and Hyeoncheol Kim",
year = "2004",
month = "12",
day = "1",
language = "English",
volume = "3316",
pages = "827--833",
journal = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
issn = "0302-9743",
publisher = "Springer Verlag",

}

TY - JOUR

T1 - Automated classification of industry and occupation codes using document classification method

AU - Lim, Heui Seok

AU - Kim, Hyeoncheol

PY - 2004/12/1

Y1 - 2004/12/1

N2 - This paper describes development of the automated industry and occupation coding system for the Korean Census records. The purpose of the system is to convert natural language responses on survey questionnaires into corresponding numeric codes according to standard code book from the Census Bureau. We employ kNN(k Nearest Neighbors)-based document classification method and information retrieval techniques to index and to weight index terms. In order to solve the description inconsistency of many respondents, we use nouns and phrases acquired from past census data. Using the data, we could estimate the nouns or phrases frequently used to describe a certain code. The Experimental results show that the past census data plays an important role in increasing code classification accuracy.

AB - This paper describes development of the automated industry and occupation coding system for the Korean Census records. The purpose of the system is to convert natural language responses on survey questionnaires into corresponding numeric codes according to standard code book from the Census Bureau. We employ kNN(k Nearest Neighbors)-based document classification method and information retrieval techniques to index and to weight index terms. In order to solve the description inconsistency of many respondents, we use nouns and phrases acquired from past census data. Using the data, we could estimate the nouns or phrases frequently used to describe a certain code. The Experimental results show that the past census data plays an important role in increasing code classification accuracy.

UR - http://www.scopus.com/inward/record.url?scp=35048894925&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=35048894925&partnerID=8YFLogxK

M3 - Article

AN - SCOPUS:35048894925

VL - 3316

SP - 827

EP - 833

JO - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

JF - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SN - 0302-9743

ER -