Handling Out-Of-Vocabulary Problem in Hangeul Word Embeddings

Ohjoon Kwon, Dohyun Kim, Soo-Ryeon Lee, Junyoung Choi, SangKeun Lee

Semantics: Lexical Semantics Long paper Paper

Gather-1E: Apr 21, Gather-1E: Apr 21 (13:00-15:00 UTC) [Join Gather Meeting]

You can open the pre-recorded video in separate windows.

Abstract: Word embedding is considered an essential factor in improving the performance of various Natural Language Processing (NLP) models. However, it is hardly applicable in real-world datasets as word embedding is generally studied with a well-refined corpus. Notably, in Hangeul (Korean writing system), which has a unique writing system, various kinds of Out-Of-Vocabulary (OOV) appear from typos. In this paper, we propose a robust Hangeul word embedding model against typos, while maintaining high performance. The proposed model utilizes a Convolutional Neural Network (CNN) architecture with a channel attention mechanism that learns to infer the original word embeddings. The model train with a dataset that consists of a mix of typos and correct words. To demonstrate the effectiveness of the proposed model, we conduct three kinds of intrinsic and extrinsic tasks. While the existing embedding models fail to maintain stable performance as the noise level increases, the proposed model shows stable performance.
NOTE: Video may display a random order of authors. Correct author list is at the top of this page.

Connected Papers in EACL2021

Similar Papers

Cross-lingual Contextualized Topic Models with Zero-shot Learning
Federico Bianchi, Silvia Terragni, Dirk Hovy, Debora Nozza, Elisabetta Fersini,
Subword Pooling Makes a Difference
Judit Ács, Ákos Kádár, Andras Kornai,