ANTHRO: Perturbations in the Wild

Repository of the paper "Perturbations in the Wild: Leveraging Human-Written Text Perturbations for Realistic Adversarial Attack and Defense", ACL22 (Findings) [pdf]

DEMO version of ANTHRO is accepted to ICDE 2023, Demo track. Part of the system is accessible via link

Instructions

Please use the test.py file for instructions.

Extract Potential Perturbations on External Corpus

from anthro_lib import ANTHRO
anthro = ANTHRO()

Now, we can extract potential perturbations from a text, and save the results to local disk.

texts = ['democrats', 'demokrats', 'democRATs', 'republicans', 'repubLIEcans', 'republiCUNTs']
anthro.process(texts)
anthro.save('./saved')

Then, we can use search for perturbations of an input token.

print(anthro.get_similars('republicans', level=1, distance=5, strict=False))
>> {'republicans', 'repubLIEcans', 'republiCUNTs'}

There are two modes of finding perturbations. One is strict--i.e., strict=True and one is non-strict--i.e., strict=False. Non-strict mode outputs more perturbations (some of they are pretty creative), but more false positive. We only use strict=True in all experiments in our paper.

Load V1.0 ANTHRO Dictionary

We have also provided a big dictionary of 2,083,037 unique cased tokens of a total 407,620 unique sounds found on social media (assumed to be written by human writers). This dictionary can be easily loaded. We encourage other researchers to use this dictionary as ANTHRO's baseline.

anthro = ANTHRO()
anthro.load('./ANTHRO_Data_V1.0')

print(anthro.get_similars('presidents', level=1, distance=5, strict=True))
>> {'Presidents', 'PRESIDENTS', 'prsidents', 'PRESIDENTs', 'presidents'}

print(anthro.get_similars('biden', level=1, distance=5, strict=True))
>> {'BiDEN', 'bieden', 'biiden', 'bidennn', 'biyden', 'BIDENNN', 'biDen', 'BIDEENN', 'bidenn', 'bideN', 'Bidn', 'BideN', 'BIDEN', 'bIDen', 'Biddn', 'Biden', 'BIDENN', 'bideeen', 'BIdEn', 'biddden', 'Bideeen', 'biiiden', 'BIDDDEENNN', 'Bidien', 'bidenN', 'Biiiden', 'Bideeennn', 'BIDEEEN', 'BiDen', 'BIDEn', 'BIIIDEEEN', 'bidn', 'bideen', 'BIDEEENNN', 'biden', 'bIdEn', 'BIEDEN', 'BIIIDEN', 'Bidennn', 'BIden', 'Biddeeennn', 'BIIIDENNN', 'BiDeN', 'biddeeen', 'Bieden', 'biddn', 'bIDeN', 'Bideen'}

print(anthro.get_similars('trump', level=1, distance=5, strict=True))
>> {'tRuMp', 'TRMP', 'trUmP', 'truuump', 'TRUMPP', 'Trumpp', 'tRuMP', 'trUmp', 'trrump', 'trumpP', 'Trummmp', 'trumppp', 'TRUMMMPPP', 'TRUUUMP', 'TRuMp', 'Trummp', 'Truuump', 'TRUmp', 'Trmp', 'trump', 'TrUMp', 'trumP', 'TRUMP', 'TrUMP', 'Truuummp', 'TRUMPPP', 'trummp', 'Trump', 'TruMP', 'trUMp', 'tRump', 'TRUUUMMPP', 'tRUmp', 'tRUMp', 'TrUmp', 'tRUMP', 'truump', 'TrumP', 'TrUmP', 'TruMp', 'trumpp', 'trmp', 'Trrump', 'Trumppp', 'TRump', 'tRumP', 'tRUmP'}

If you use the ANTHRO_Data_V1.0 dictionary as baseline for comparison, please note that results reported in our paper is based on a richer dictionary, which was extracted from a corpus that include some private datasets, and cannot be publicly released.

Please cite our paper with the following Bibtex:

@article{le2022perturbations,
  title={Perturbations in the Wild: Leveraging Human-Written Text Perturbations for Realistic Adversarial Attack and Defense},
  author={Le, Thai and Lee, Jooyoung and Yen, Kevin and Hu, Yifan and Lee, Dongwon},
  journal={60th Annual Meeting of the Association for Computational Linguistics (ACL)},
  year={2022}
}

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
ANTHRO_Data_V1.0		ANTHRO_Data_V1.0
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
anthro_lib.py		anthro_lib.py
logo.jpg		logo.jpg
test.py		test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ANTHRO: Perturbations in the Wild

Instructions

Extract Potential Perturbations on External Corpus

Load V1.0 ANTHRO Dictionary

About

Releases

Packages

Languages

License

lethaiq/perturbations-in-the-wild

Folders and files

Latest commit

History

Repository files navigation

ANTHRO: Perturbations in the Wild

Instructions

Extract Potential Perturbations on External Corpus

Load V1.0 ANTHRO Dictionary

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages