TY - GEN
T1 - A Dog Is Passing over the Jet? A Text-Generation Dataset for Korean Commonsense Reasoning and Evaluation
AU - Seo, Jaehyung
AU - Lee, Seounghoon
AU - Park, Chanjun
AU - Jang, Yoonna
AU - Moon, Hyeonseok
AU - Eo, Sugyeong
AU - Koo, Seonmin
AU - Lim, Heuiseok
N1 - Funding Information:
This research was supported by the MSIT(Ministry of Science and ICT), Korea, under the ITRC(Information Technology Research Center) support program(IITP-2018-0-01405) supervised by the IITP(Institute for Information & Communications Technology Planning & Evaluation). This work was supported by Institute of Information & communications Technology Planning & Evaluation(IITP) grant funded by the Korea government(MSIT) (No. 2020-0-00368, A Neural-Symbolic Model for Knowledge Acquisition and Inference Techniques). This research was supported by Basic Science Research Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Education(NRF-2021R1A6A1A03045425).
Publisher Copyright:
© Findings of the Association for Computational Linguistics: NAACL 2022 - Findings.
PY - 2022
Y1 - 2022
N2 - Recent natural language understanding (NLU) research on the Korean language has been vigorously maturing with the advancements of pretrained language models and datasets. However, Korean pretrained language models still struggle to generate a short sentence with a given condition based on compositionality and commonsense reasoning (i.e., generative commonsense reasoning). The two major challenges are inadequate data resources to develop generative commonsense reasoning regarding Korean linguistic features and to evaluate language models which are necessary for natural language generation (NLG). To solve these problems, we propose a textgeneration dataset for Korean generative commonsense reasoning and language model evaluation. In this work, a semi-automatic dataset construction approach filters out contents inexplicable to commonsense, ascertains quality, and reduces the cost of building the dataset. We also present an in-depth analysis of the generation results of language models with various evaluation metrics along with human-annotated scores. The whole dataset is publicly available at (https://aihub.or. kr/opendata/korea-university).
AB - Recent natural language understanding (NLU) research on the Korean language has been vigorously maturing with the advancements of pretrained language models and datasets. However, Korean pretrained language models still struggle to generate a short sentence with a given condition based on compositionality and commonsense reasoning (i.e., generative commonsense reasoning). The two major challenges are inadequate data resources to develop generative commonsense reasoning regarding Korean linguistic features and to evaluate language models which are necessary for natural language generation (NLG). To solve these problems, we propose a textgeneration dataset for Korean generative commonsense reasoning and language model evaluation. In this work, a semi-automatic dataset construction approach filters out contents inexplicable to commonsense, ascertains quality, and reduces the cost of building the dataset. We also present an in-depth analysis of the generation results of language models with various evaluation metrics along with human-annotated scores. The whole dataset is publicly available at (https://aihub.or. kr/opendata/korea-university).
UR - http://www.scopus.com/inward/record.url?scp=85137353354&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85137353354
T3 - Findings of the Association for Computational Linguistics: NAACL 2022 - Findings
SP - 2233
EP - 2249
BT - Findings of the Association for Computational Linguistics
PB - Association for Computational Linguistics (ACL)
T2 - 2022 Findings of the Association for Computational Linguistics: NAACL 2022
Y2 - 10 July 2022 through 15 July 2022
ER -