Toward robust classification using the Open Directory Project

Jongwoo Ha, Jung Hyun Lee, Won Jun Jang, Yong Ku Lee, Sang-Geun Lee

Research output: Chapter in Book/Report/Conference proceedingConference contribution

9 Citations (Scopus)

Abstract

The Open Directory Project (ODP) is a large scale, high quality and publicly available web directory utilized in many studies and real-world applications. In this paper, we explore training data expansion techniques for text classification as one of the possible directions to deal with the sparse characteristic of the ODP dataset. We propose a dozen classification methods, which can be differentiated by (1) from which categories training data is expanded, and (2) how the expanded training data is merged to generate centroid vectors. Evaluation results show that training data expansion significantly improves the classification performance more than representative classifiers. We also find that (1) child and descendant categories are more valuable sources to expand training data than parent and ancestor categories, and (2) distance-based weighting is superior to simple averaging to merge the expanded training data.

Original languageEnglish
Title of host publicationDSAA 2014 - Proceedings of the 2014 IEEE International Conference on Data Science and Advanced Analytics
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages607-612
Number of pages6
ISBN (Print)9781479969913
DOIs
Publication statusPublished - 2014 Mar 10
Event2014 IEEE International Conference on Data Science and Advanced Analytics, DSAA 2014 - Shanghai, China
Duration: 2014 Oct 302014 Nov 1

Other

Other2014 IEEE International Conference on Data Science and Advanced Analytics, DSAA 2014
CountryChina
CityShanghai
Period14/10/3014/11/1

Fingerprint

Classifiers
World Wide Web
Classifier
Weighting
Text classification
Evaluation

ASJC Scopus subject areas

  • Artificial Intelligence
  • Information Systems
  • Information Systems and Management

Cite this

Ha, J., Lee, J. H., Jang, W. J., Lee, Y. K., & Lee, S-G. (2014). Toward robust classification using the Open Directory Project. In DSAA 2014 - Proceedings of the 2014 IEEE International Conference on Data Science and Advanced Analytics (pp. 607-612). [7058134] Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/DSAA.2014.7058134

Toward robust classification using the Open Directory Project. / Ha, Jongwoo; Lee, Jung Hyun; Jang, Won Jun; Lee, Yong Ku; Lee, Sang-Geun.

DSAA 2014 - Proceedings of the 2014 IEEE International Conference on Data Science and Advanced Analytics. Institute of Electrical and Electronics Engineers Inc., 2014. p. 607-612 7058134.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Ha, J, Lee, JH, Jang, WJ, Lee, YK & Lee, S-G 2014, Toward robust classification using the Open Directory Project. in DSAA 2014 - Proceedings of the 2014 IEEE International Conference on Data Science and Advanced Analytics., 7058134, Institute of Electrical and Electronics Engineers Inc., pp. 607-612, 2014 IEEE International Conference on Data Science and Advanced Analytics, DSAA 2014, Shanghai, China, 14/10/30. https://doi.org/10.1109/DSAA.2014.7058134
Ha J, Lee JH, Jang WJ, Lee YK, Lee S-G. Toward robust classification using the Open Directory Project. In DSAA 2014 - Proceedings of the 2014 IEEE International Conference on Data Science and Advanced Analytics. Institute of Electrical and Electronics Engineers Inc. 2014. p. 607-612. 7058134 https://doi.org/10.1109/DSAA.2014.7058134
Ha, Jongwoo ; Lee, Jung Hyun ; Jang, Won Jun ; Lee, Yong Ku ; Lee, Sang-Geun. / Toward robust classification using the Open Directory Project. DSAA 2014 - Proceedings of the 2014 IEEE International Conference on Data Science and Advanced Analytics. Institute of Electrical and Electronics Engineers Inc., 2014. pp. 607-612
@inproceedings{cfe1f5b3477141f2942e37cc8817ec92,
title = "Toward robust classification using the Open Directory Project",
abstract = "The Open Directory Project (ODP) is a large scale, high quality and publicly available web directory utilized in many studies and real-world applications. In this paper, we explore training data expansion techniques for text classification as one of the possible directions to deal with the sparse characteristic of the ODP dataset. We propose a dozen classification methods, which can be differentiated by (1) from which categories training data is expanded, and (2) how the expanded training data is merged to generate centroid vectors. Evaluation results show that training data expansion significantly improves the classification performance more than representative classifiers. We also find that (1) child and descendant categories are more valuable sources to expand training data than parent and ancestor categories, and (2) distance-based weighting is superior to simple averaging to merge the expanded training data.",
author = "Jongwoo Ha and Lee, {Jung Hyun} and Jang, {Won Jun} and Lee, {Yong Ku} and Sang-Geun Lee",
year = "2014",
month = "3",
day = "10",
doi = "10.1109/DSAA.2014.7058134",
language = "English",
isbn = "9781479969913",
pages = "607--612",
booktitle = "DSAA 2014 - Proceedings of the 2014 IEEE International Conference on Data Science and Advanced Analytics",
publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - GEN

T1 - Toward robust classification using the Open Directory Project

AU - Ha, Jongwoo

AU - Lee, Jung Hyun

AU - Jang, Won Jun

AU - Lee, Yong Ku

AU - Lee, Sang-Geun

PY - 2014/3/10

Y1 - 2014/3/10

N2 - The Open Directory Project (ODP) is a large scale, high quality and publicly available web directory utilized in many studies and real-world applications. In this paper, we explore training data expansion techniques for text classification as one of the possible directions to deal with the sparse characteristic of the ODP dataset. We propose a dozen classification methods, which can be differentiated by (1) from which categories training data is expanded, and (2) how the expanded training data is merged to generate centroid vectors. Evaluation results show that training data expansion significantly improves the classification performance more than representative classifiers. We also find that (1) child and descendant categories are more valuable sources to expand training data than parent and ancestor categories, and (2) distance-based weighting is superior to simple averaging to merge the expanded training data.

AB - The Open Directory Project (ODP) is a large scale, high quality and publicly available web directory utilized in many studies and real-world applications. In this paper, we explore training data expansion techniques for text classification as one of the possible directions to deal with the sparse characteristic of the ODP dataset. We propose a dozen classification methods, which can be differentiated by (1) from which categories training data is expanded, and (2) how the expanded training data is merged to generate centroid vectors. Evaluation results show that training data expansion significantly improves the classification performance more than representative classifiers. We also find that (1) child and descendant categories are more valuable sources to expand training data than parent and ancestor categories, and (2) distance-based weighting is superior to simple averaging to merge the expanded training data.

UR - http://www.scopus.com/inward/record.url?scp=84937766992&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84937766992&partnerID=8YFLogxK

U2 - 10.1109/DSAA.2014.7058134

DO - 10.1109/DSAA.2014.7058134

M3 - Conference contribution

AN - SCOPUS:84937766992

SN - 9781479969913

SP - 607

EP - 612

BT - DSAA 2014 - Proceedings of the 2014 IEEE International Conference on Data Science and Advanced Analytics

PB - Institute of Electrical and Electronics Engineers Inc.

ER -