TY - JOUR
T1 - Decoding strategies for improving low-resource machine translation
AU - Park, Chanjun
AU - Yang, Yeongwook
AU - Park, Kinam
AU - Lim, Heuiseok
N1 - Funding Information:
Funding: This work was supported by the Institute for Information and communications Technology Planning and Evaluation (IITP) grant funded by the Korea government Ministry of Science and ICT (MSIT) (No. 2020-0-00368, A Neural-Symbolic Model for Knowledge Acquisition and Inference Techniques), and this research was supported by the MSIT, Korea, under the ITRC (Information Technology Research Center) support program (IITP-2020-2018-0-01405) supervised by the IITP (Institute for Information and Communications Technology Planning and Evaluation).
Funding Information:
This work was supported by the Institute for Information and communications Technology Planning and Evaluation (IITP) grant funded by the Korea government Ministry of Science and ICT (MSIT) (No. 2020-0-00368, A Neural-Symbolic Model for Knowledge Acquisition and Inference Techniques), and this research was supported by the MSIT, Korea, under the ITRC (Information Technology Research Center) support program (IITP-2020-2018-0-01405) supervised by the IITP (Institute for Information and Communications Technology Planning and Evaluation). Acknowledgments: Thanks to AI Hub for creating a great dataset. I am very grateful to my friend Yeonsu Lee (Sungkyunkwan University) for helping me with English corrections.
Publisher Copyright:
© 2020 by the authors. Licensee MDPI, Basel, Switzerland.
PY - 2020/10
Y1 - 2020/10
N2 - Pre-processing and post-processing are significant aspects of natural language processing (NLP) application software. Pre-processing in neural machine translation (NMT) includes subword tokenization to alleviate the problem of unknown words, parallel corpus filtering that only filters data suitable for training, and data augmentation to ensure that the corpus contains sufficient content. Post-processing includes automatic post editing and the application of various strategies during decoding in the translation process. Most recent NLP researches are based on the Pretrain-Finetuning Approach (PFA). However, when small and medium-sized organizations with insufficient hardware attempt to provide NLP services, throughput and memory problems often occur. These difficulties increase when utilizing PFA to process low-resource languages, as PFA requires large amounts of data, and the data for low-resource languages are often insufficient. Utilizing the current research premise that NMT model performance can be enhanced through various pre-processing and post-processing strategies without changing the model, we applied various decoding strategies to Korean–English NMT, which relies on a low-resource language pair. Through comparative experiments, we proved that translation performance could be enhanced without changes to the model. We experimentally examined how performance changed in response to beam size changes and n-gram blocking, and whether performance was enhanced when a length penalty was applied. The results showed that various decoding strategies enhance the performance and compare well with previous Korean–English NMT approaches. Therefore, the proposed methodology can improve the performance of NMT models, without the use of PFA; this presents a new perspective for improving machine translation performance.
AB - Pre-processing and post-processing are significant aspects of natural language processing (NLP) application software. Pre-processing in neural machine translation (NMT) includes subword tokenization to alleviate the problem of unknown words, parallel corpus filtering that only filters data suitable for training, and data augmentation to ensure that the corpus contains sufficient content. Post-processing includes automatic post editing and the application of various strategies during decoding in the translation process. Most recent NLP researches are based on the Pretrain-Finetuning Approach (PFA). However, when small and medium-sized organizations with insufficient hardware attempt to provide NLP services, throughput and memory problems often occur. These difficulties increase when utilizing PFA to process low-resource languages, as PFA requires large amounts of data, and the data for low-resource languages are often insufficient. Utilizing the current research premise that NMT model performance can be enhanced through various pre-processing and post-processing strategies without changing the model, we applied various decoding strategies to Korean–English NMT, which relies on a low-resource language pair. Through comparative experiments, we proved that translation performance could be enhanced without changes to the model. We experimentally examined how performance changed in response to beam size changes and n-gram blocking, and whether performance was enhanced when a length penalty was applied. The results showed that various decoding strategies enhance the performance and compare well with previous Korean–English NMT approaches. Therefore, the proposed methodology can improve the performance of NMT models, without the use of PFA; this presents a new perspective for improving machine translation performance.
KW - Decoding strategies
KW - Efficiency processing
KW - Korean–English neural machine translation
KW - Neural machine translation
KW - Post-processing
KW - Transformer
UR - http://www.scopus.com/inward/record.url?scp=85091663903&partnerID=8YFLogxK
U2 - 10.3390/electronics9101562
DO - 10.3390/electronics9101562
M3 - Article
AN - SCOPUS:85091663903
SN - 2079-9292
VL - 9
SP - 1
EP - 15
JO - Electronics (Switzerland)
JF - Electronics (Switzerland)
IS - 10
M1 - 1562
ER -