A reproduction of Apple's bi-directional LSTM models for language identification in short strings

Mads Toftrup, Søren Asger Sørensen, Manuel Ciosici, Ira Assent

Student Research Workshop Long paper Paper

Gather-2F: Apr 22, Gather-2F: Apr 22 (13:00-15:00 UTC) [Join Gather Meeting]

Abstract: Language Identification is the task of identifying a document's language. For applications like automatic spell checker selection, language identification must use very short strings such as text message fragments. In this work, we reproduce a language identification architecture that Apple briefly sketched in a blog post. We confirm the bi-LSTM model's performance and find that it outperforms current open-source language identifiers. We further find that its language identification mistakes are due to confusion between related languages.

Connected Papers in EACL2021

Similar Papers

Searching for Search Errors in Neural Morphological Inflection
Martina Forster, Clara Meister, Ryan Cotterell,
Language Models for Lexical Inference in Context
Martin Schmitt, Hinrich Schütze,
Beyond the English Web: Zero-Shot Cross-Lingual and Lightweight Monolingual Classification of Registers
Liina Repo, Valtteri Skantsi, Samuel Rönnqvist, Saara Hellström, Miika Oinonen, Anna Salmela, Douglas Biber, Jesse Egbert, Sampo Pyysalo, Veronika Laippala,