TY - GEN
T1 - Toward robust classification using the Open Directory Project
AU - Ha, Jongwoo
AU - Lee, Jung Hyun
AU - Jang, Won Jun
AU - Lee, Yong Ku
AU - Lee, Sang-Geun
PY - 2014/3/10
Y1 - 2014/3/10
N2 - The Open Directory Project (ODP) is a large scale, high quality and publicly available web directory utilized in many studies and real-world applications. In this paper, we explore training data expansion techniques for text classification as one of the possible directions to deal with the sparse characteristic of the ODP dataset. We propose a dozen classification methods, which can be differentiated by (1) from which categories training data is expanded, and (2) how the expanded training data is merged to generate centroid vectors. Evaluation results show that training data expansion significantly improves the classification performance more than representative classifiers. We also find that (1) child and descendant categories are more valuable sources to expand training data than parent and ancestor categories, and (2) distance-based weighting is superior to simple averaging to merge the expanded training data.
AB - The Open Directory Project (ODP) is a large scale, high quality and publicly available web directory utilized in many studies and real-world applications. In this paper, we explore training data expansion techniques for text classification as one of the possible directions to deal with the sparse characteristic of the ODP dataset. We propose a dozen classification methods, which can be differentiated by (1) from which categories training data is expanded, and (2) how the expanded training data is merged to generate centroid vectors. Evaluation results show that training data expansion significantly improves the classification performance more than representative classifiers. We also find that (1) child and descendant categories are more valuable sources to expand training data than parent and ancestor categories, and (2) distance-based weighting is superior to simple averaging to merge the expanded training data.
UR - http://www.scopus.com/inward/record.url?scp=84937766992&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84937766992&partnerID=8YFLogxK
U2 - 10.1109/DSAA.2014.7058134
DO - 10.1109/DSAA.2014.7058134
M3 - Conference contribution
AN - SCOPUS:84937766992
T3 - DSAA 2014 - Proceedings of the 2014 IEEE International Conference on Data Science and Advanced Analytics
SP - 607
EP - 612
BT - DSAA 2014 - Proceedings of the 2014 IEEE International Conference on Data Science and Advanced Analytics
A2 - Karypis, George
A2 - Cao, Longbing
A2 - Wang, Wei
A2 - King, Irwin
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2014 IEEE International Conference on Data Science and Advanced Analytics, DSAA 2014
Y2 - 30 October 2014 through 1 November 2014
ER -