An empirical study on web mining of parallel data

Gumwon Hong, Chi Ho Li, Ming Zhou, Hae-Chang Rim

Research output: Chapter in Book/Report/Conference proceedingConference contribution

10 Citations (Scopus)

Abstract

This paper 1 presents an empirical approach to mining parallel corpora. Conventional approaches use a readily available collection of comparable, nonparallel corpora to extract parallel sentences. This paper attempts the much more challenging task of directly searching for high-quality sentence pairs from the Web. We tackle the problem by formulating good search query using Learning to Rank? and by filtering noisy document pairs using IBM Model 1 alignment. End-to-end evaluation shows that the proposed approach significantly improves the performance of statistical machine translation.

Original languageEnglish
Title of host publicationColing 2010 - 23rd International Conference on Computational Linguistics, Proceedings of the Conference
Pages474-482
Number of pages9
Volume2
Publication statusPublished - 2010 Dec 1
Event23rd International Conference on Computational Linguistics, Coling 2010 - Beijing, China
Duration: 2010 Aug 232010 Aug 27

Other

Other23rd International Conference on Computational Linguistics, Coling 2010
CountryChina
CityBeijing
Period10/8/2310/8/27

Fingerprint

evaluation
learning
performance
Web Mining
Empirical Study
Alignment
Conventional
Statistical Machine Translation
Parallel Corpora
Comparable Corpora
Evaluation

ASJC Scopus subject areas

  • Language and Linguistics
  • Computational Theory and Mathematics
  • Linguistics and Language

Cite this

Hong, G., Li, C. H., Zhou, M., & Rim, H-C. (2010). An empirical study on web mining of parallel data. In Coling 2010 - 23rd International Conference on Computational Linguistics, Proceedings of the Conference (Vol. 2, pp. 474-482)

An empirical study on web mining of parallel data. / Hong, Gumwon; Li, Chi Ho; Zhou, Ming; Rim, Hae-Chang.

Coling 2010 - 23rd International Conference on Computational Linguistics, Proceedings of the Conference. Vol. 2 2010. p. 474-482.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Hong, G, Li, CH, Zhou, M & Rim, H-C 2010, An empirical study on web mining of parallel data. in Coling 2010 - 23rd International Conference on Computational Linguistics, Proceedings of the Conference. vol. 2, pp. 474-482, 23rd International Conference on Computational Linguistics, Coling 2010, Beijing, China, 10/8/23.
Hong G, Li CH, Zhou M, Rim H-C. An empirical study on web mining of parallel data. In Coling 2010 - 23rd International Conference on Computational Linguistics, Proceedings of the Conference. Vol. 2. 2010. p. 474-482
Hong, Gumwon ; Li, Chi Ho ; Zhou, Ming ; Rim, Hae-Chang. / An empirical study on web mining of parallel data. Coling 2010 - 23rd International Conference on Computational Linguistics, Proceedings of the Conference. Vol. 2 2010. pp. 474-482
@inproceedings{9b7dcaba372746cc8bd148bcf063122a,
title = "An empirical study on web mining of parallel data",
abstract = "This paper 1 presents an empirical approach to mining parallel corpora. Conventional approaches use a readily available collection of comparable, nonparallel corpora to extract parallel sentences. This paper attempts the much more challenging task of directly searching for high-quality sentence pairs from the Web. We tackle the problem by formulating good search query using Learning to Rank? and by filtering noisy document pairs using IBM Model 1 alignment. End-to-end evaluation shows that the proposed approach significantly improves the performance of statistical machine translation.",
author = "Gumwon Hong and Li, {Chi Ho} and Ming Zhou and Hae-Chang Rim",
year = "2010",
month = "12",
day = "1",
language = "English",
volume = "2",
pages = "474--482",
booktitle = "Coling 2010 - 23rd International Conference on Computational Linguistics, Proceedings of the Conference",

}

TY - GEN

T1 - An empirical study on web mining of parallel data

AU - Hong, Gumwon

AU - Li, Chi Ho

AU - Zhou, Ming

AU - Rim, Hae-Chang

PY - 2010/12/1

Y1 - 2010/12/1

N2 - This paper 1 presents an empirical approach to mining parallel corpora. Conventional approaches use a readily available collection of comparable, nonparallel corpora to extract parallel sentences. This paper attempts the much more challenging task of directly searching for high-quality sentence pairs from the Web. We tackle the problem by formulating good search query using Learning to Rank? and by filtering noisy document pairs using IBM Model 1 alignment. End-to-end evaluation shows that the proposed approach significantly improves the performance of statistical machine translation.

AB - This paper 1 presents an empirical approach to mining parallel corpora. Conventional approaches use a readily available collection of comparable, nonparallel corpora to extract parallel sentences. This paper attempts the much more challenging task of directly searching for high-quality sentence pairs from the Web. We tackle the problem by formulating good search query using Learning to Rank? and by filtering noisy document pairs using IBM Model 1 alignment. End-to-end evaluation shows that the proposed approach significantly improves the performance of statistical machine translation.

UR - http://www.scopus.com/inward/record.url?scp=80053393112&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=80053393112&partnerID=8YFLogxK

M3 - Conference contribution

VL - 2

SP - 474

EP - 482

BT - Coling 2010 - 23rd International Conference on Computational Linguistics, Proceedings of the Conference

ER -