Mega-COV: A Billion-Scale Dataset of 100+ Languages for COVID-19

Muhammad Abdul-Mageed, AbdelRahim Elmadany, El Moatez Billah Nagoudi, Dinesh Pabbi, Kunal Verma, Rannie Lin

Language Resources and Evaluation Long paper Paper

Gather-1A: Apr 21, Gather-1A: Apr 21 (13:00-15:00 UTC) [Join Gather Meeting]

You can open the pre-recorded video in separate windows.

Abstract: We describe Mega-COV, a billion-scale dataset from Twitter for studying COVID-19. The dataset is diverse (covers 268 countries), longitudinal (goes as back as 2007), multilingual (comes in 100+ languages), and has a significant number of location-tagged tweets (~169M tweets). We release tweet IDs from the dataset. We also develop two powerful models, one for identifying whether or not a tweet is related to the pandemic (best F1=97%) and another for detecting misinformation about COVID-19 (best F1=92%). A human annotation study reveals the utility of our models on a subset of Mega-COV. Our data and models can be useful for studying a wide host of phenomena related to the pandemic. Mega-COV and our models are publicly available.
NOTE: Video may display a random order of authors. Correct author list is at the top of this page.

Connected Papers in EACL2021

Similar Papers

Challenges in Automated Debiasing for Toxic Language Detection
Xuhui Zhou, Maarten Sap, Swabha Swayamdipta, Yejin Choi, Noah Smith,
CovRelex: A COVID-19 Retrieval System with Relation Extraction
Vu Tran, Van-Hien Tran, Phuong Nguyen, Chau Nguyen, Ken Satoh, Yuji Matsumoto, Minh Nguyen,