TY - JOUR
T1 - Ancient Korean Neural Machine Translation
AU - Park, Chanjun
AU - Lee, Chanhee
AU - Yang, Yeongwook
AU - Lim, Heuiseok
N1 - Funding Information:
This work was supported in part by the Ministry of Science and ICT (MSIT), South Korea, through the Information Technology Research Center (ITRC) Support Program, supervised by the Institute for Information and Communications Technology Planning and Evaluation (IITP), under Grant IITP-2020-2018-0-01405, and in part by the National Research Foundation of Korea (NRF) funded by the Korea Government (MSIP) under Grant NRF-2017M3C4A7068189.
Publisher Copyright:
© 2013 IEEE.
PY - 2020
Y1 - 2020
N2 - Translation of the languages of ancient times can serve as a source for the content of various digital media and can be helpful in various fields such as natural phenomena, medicine, and science. Owing to these needs, there has been a global movement to translate ancient languages, but expert minds are required for this purpose. It is difficult to train language experts, and more importantly, manual translation is a slow process. Consequently, the recovery of ancient characters using machine translation has been recently investigated, but there is currently no literature on the machine translation of ancient Korean. This paper proposes the first ancient Korean neural machine translation model using a Transformer. This model can improve the efficiency of a translator by quickly providing a draft translation for a number of untranslated ancient documents. Furthermore, a new subword tokenization method called the Share Vocabulary and Entity Restriction Byte Pair Encoding is proposed based on the characteristics of ancient Korean sentences. This proposed method yields an increase in the performance of the original conventional subword tokenization methods such as byte pair encoding by 5.25 BLEU points. In addition, various decoding strategies such as n-gram blocking and ensemble models further improve the performance by 2.89 BLEU points. The model has been made publicly available as a software application.
AB - Translation of the languages of ancient times can serve as a source for the content of various digital media and can be helpful in various fields such as natural phenomena, medicine, and science. Owing to these needs, there has been a global movement to translate ancient languages, but expert minds are required for this purpose. It is difficult to train language experts, and more importantly, manual translation is a slow process. Consequently, the recovery of ancient characters using machine translation has been recently investigated, but there is currently no literature on the machine translation of ancient Korean. This paper proposes the first ancient Korean neural machine translation model using a Transformer. This model can improve the efficiency of a translator by quickly providing a draft translation for a number of untranslated ancient documents. Furthermore, a new subword tokenization method called the Share Vocabulary and Entity Restriction Byte Pair Encoding is proposed based on the characteristics of ancient Korean sentences. This proposed method yields an increase in the performance of the original conventional subword tokenization methods such as byte pair encoding by 5.25 BLEU points. In addition, various decoding strategies such as n-gram blocking and ensemble models further improve the performance by 2.89 BLEU points. The model has been made publicly available as a software application.
KW - Ancient Korean translation
KW - neural machine translation
KW - share vocabulary and entity restriction byte pair encoding
KW - subword tokenization
KW - transformer
UR - http://www.scopus.com/inward/record.url?scp=85087819146&partnerID=8YFLogxK
U2 - 10.1109/ACCESS.2020.3004879
DO - 10.1109/ACCESS.2020.3004879
M3 - Article
AN - SCOPUS:85087819146
VL - 8
SP - 116617
EP - 116625
JO - IEEE Access
JF - IEEE Access
SN - 2169-3536
M1 - 9125904
ER -