Building Representative Corpora from Illiterate Communities: A Reviewof Challenges and Mitigation Strategies for Developing Countries

Stephanie Hirmer, Alycia Leonard, Josephine Tumwesige, Costanza Conforti

Language Resources and Evaluation Long paper Paper

Zoom-3C: Apr 22, Zoom-3C: Apr 22 (07:00-08:00 UTC) [Join Zoom Meeting]
Gather-3D: Apr 23, Gather-3D: Apr 23 (13:00-15:00 UTC) [Join Gather Meeting]

You can open the pre-recorded video in separate windows.

Abstract: Most well-established data collection methods currently adopted in NLP depend on the as- sumption of speaker literacy. Consequently, the collected corpora largely fail to represent swathes of the global population, which tend to be some of the most vulnerable and marginalised people in society, and often live in rural developing areas. Such underrepresented groups are thus not only ignored when making modeling and system design decisions, but also prevented from benefiting from development outcomes achieved through data-driven NLP. This paper aims to address the under-representation of illiterate communities in NLP corpora: we identify potential biases and ethical issues that might arise when collecting data from rural communities with high illiteracy rates in Low-Income Countries, and propose a set of practical mitigation strategies to help future work.
NOTE: Video may display a random order of authors. Correct author list is at the top of this page.

Connected Papers in EACL2021

Similar Papers

Us vs. Them: A Dataset of Populist Attitudes, News Bias and Emotions
Pere-Lluís Huguet Cabot, David Abadi, Agneta Fischer, Ekaterina Shutova,
MultiHumES: Multilingual Humanitarian Dataset for Extractive Summarization
Jenny Paola Yela-Bello, Ewan Oglethorpe, Navid Rekabsaz,
Challenges in Automated Debiasing for Toxic Language Detection
Xuhui Zhou, Maarten Sap, Swabha Swayamdipta, Yejin Choi, Noah Smith,
Probing for idiomaticity in vector space models
Marcos Garcia, Tiago Kramer Vieira, Carolina Scarton, Marco Idiart, Aline Villavicencio,