Frequency-Guided Word Substitutions for Detecting Textual Adversarial Examples

Maximilian Mozes, Pontus Stenetorp, Bennett Kleinberg, Lewis Griffin

Interpretability and Analysis of Models for NLP Short paper Paper

Gather-3C: Apr 23, Gather-3C: Apr 23 (13:00-15:00 UTC) [Join Gather Meeting]

You can open the pre-recorded video in separate windows.

Abstract: Recent efforts have shown that neural text processing models are vulnerable to adversarial examples, but the nature of these examples is poorly understood. In this work, we show that adversarial attacks against CNN, LSTM and Transformer-based classification models perform word substitutions that are identifiable through frequency differences between replaced words and their corresponding substitutions. Based on these findings, we propose frequency-guided word substitutions (FGWS), a simple algorithm exploiting the frequency properties of adversarial word substitutions for the detection of adversarial examples. FGWS achieves strong performance by accurately detecting adversarial examples on the SST-2 and IMDb sentiment datasets, with F1 detection scores of up to 91.4% against RoBERTa-based classification models. We compare our approach against a recently proposed perturbation discrimination framework and show that we outperform it by up to 13.0% F1.
NOTE: Video may display a random order of authors. Correct author list is at the top of this page.

Connected Papers in EACL2021

Similar Papers

On Robustness of Neural Semantic Parsers
shuo huang, Zhuang Li, Lizhen Qu, Lei Pan,
Adv-OLM: Generating Textual Adversaries via OLM
Vijit Malik, Ashwani Bhat, Ashutosh Modi,
On-Device Text Representations Robust To Misspellings via Projections
Chinnadhurai Sankar, Sujith Ravi, Zornitsa Kozareva,