GitHub - marcelomf/HateBR: HateBR is the first large-scale expert annotated corpus of Brazilian Instagram comments for hate speech and offensive language detection on the web and social media.

HateBR - Offensive Language and Hate Speech Dataset in Brazilian Portuguese

HateBR is the first large-scale expert annotated corpus of Brazilian Instagram comments for hate speech and offensive language detection on the web and social media. The HateBR corpus was collected from Brazilian Instagram comments of politicians and manually annotated by specialists. It is composed of 7,000 documents annotated according to three different layers: a binary classification (offensive versus non-offensive comments), offensiveness-level (highly, moderately, and slightly offensive messages), and nine hate speech groups (xenophobia, racism, homophobia, sexism, religious intolerance, partyism, apology for the dictatorship, antisemitism, and fatphobia). Each comment was annotated by three different annotators and achieved high inter-annotator agreement. Furthermore, baseline experiments were implemented reaching 85% of F1-score outperforming the current literature models for the Portuguese language. Accordingly, we hope that the proposed expertly annotated corpus may foster research on hate speech and offensive language detection in the Natural Language Processing area.

This repository contains the corpus and the best models presented in the paper (see section "citing"). HateBr.csv file is composed of an offensive language and hate speech annotated corpus, which provides 4 (four) columns as described above:

1st column: Instagram comments.
2nd column: Offensive language classification divided into offensive comments versus non-offensive comments.
3rd column: Offensiveness-level classification divided into highly offensive, moderately offensive, and slightly offensive.
4rd column: Hate speech classification divided into nine different hate groups: antisemitism, apology for the dictatorship, fatphobia, homophobia, partyism, racism, religious intolerance, sexism, and xenophobia. At last, offensive & no hate speech comments also was classified.

The following table describes in detail the labels for each proposed layer of annotation:

Offensive Language

Offensiveness Levels

Hate Speech

class	label	total
offensive	1	3,500
non-offensive	0	3,500
Total		7,000

class	label	total
highly	3	778
moderately	2	1,044
slightly	1	1,678
non-offensive	0	3,500
Total		7,000

class	label	total
antisemitism	1	2
apology for the dictatorship	2	32
fatphobia	3	27
homophobia	4	17
partyism	5	496
racism	6	8
religious intolerance	7	47
sexism	8	97
xenophobia	9	1
offensive & non-hate speech	-1	2,773
non-offensive	0	3,500
Total		7,000

In addition, we also provide baseline machine learning results for both task: offensive language and hate speech detection. The best obtained models is available here in .pkl files. File names are organized as [classification (offensive or hate)_representation (ngram or tfidf)_algorithms (nb, svm, mlp or lr)]. For example, the file offensive_tfidf_svm.pkl presents the model about offensive detection with tf-idf representation using the support vector machine algorithm.

CITING

Vargas, F., Carvalho, I., Góes, F. R., Pardo, T.A.S., Benevenuto, F. (2022). HateBR: large expert annotated corpus of Brazilian Instagram comments for offensive language and hate speech detection. Proceedings of the 13th International Conference on Language Resources and Evaluation (LREC 2022) , pp 1-10. Marseille, France. Association for Computational Linguistics (ACL).

ACKNOWLEDGEMENTS

The authors are grateful to Social Computing Laboratory at Computer Science Department from Federal University of Minas Gerais for supporting this work. This work is partially funded by CNPq, Fapemig and Fapesp.

Name		Name	Last commit message	Last commit date
Latest commit History 227 Commits
.github		.github
dataset		dataset
models		models
tables		tables
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HateBR - Offensive Language and Hate Speech Dataset in Brazilian Portuguese

CITING

ACKNOWLEDGEMENTS

About

Releases

Packages

marcelomf/HateBR

Folders and files

Latest commit

History

Repository files navigation

HateBR - Offensive Language and Hate Speech Dataset in Brazilian Portuguese

CITING

ACKNOWLEDGEMENTS

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages