GitHub - Lalasa1234/CyberbullyingDetection: Finetuning the Bert-based LLM to predict whether the tweet is toxic or not

Detection of Cyberbullying (Natural Language Processing) 🤬🗣️

Business Application in Social Media

Cyberbullying is a serious issue, especially in the age of social media, where interactions turn hurtful. Some applications of cyberbullying detection that this code can solve:

Real-time Monitoring: Continuously scan social media posts or messages for signs of cyberbullying.
Alerts and Notifications: Notify users when potentially harmful content is detected.
Reporting Mechanism: Allow users to report incidents and take appropriate action.

Which pre-trained model is chosen and why❓❔

Distilbert-base-uncased is chosen over other transformers.

- For sentiment analysis, bidirectional semantic comprehension is important, hence not choosing GPT (which is appropriate for text generation)

- It has the same capability of loss calculation and Masked Language Modeling (MLM) like that of BERT, but lighter and faster than the latter

- As part of text cleaning, the text is going to be small capped; hence, using the uncased model version as capping is not significant here

- I have limited/irregular GPU availability and a small (~2000) corpus; making Distibert the most appropriate one

Coding Blocks 👩‍💻👩‍💻 💬

EDA and Feature Engineering to remove unwanted columns, treat missing values and ensure the right datatypes of columns

Note: It is important to have integer target labels (not float); else BCEwithLogitsLoss throws an error during model training

Data Processing using Regex and NLTK

- Convert to lower case (Not necessary for the Bert uncased model)

- Remove all hastags (#), handles (@), hyperlinks (http) and URLs (www.)

- Remove all characters except numbers or alphabets (emoticons, punctuations or multi-space blocks)

- Identify commonly found irrelevant words and append them to stopwords and lemmatize

- Remove the duplicates

- Analyze the length of each sequence; this is going to be useful while padding or truncating during tokenization

Defining the Transformer Dataset and Training

- Convert the clean dataframe to a transformer dataset to leverage its fast computation and batch processing

- Use the autotokenizer associated with distilbert with the longest padding and truncation strategy

- Split into train and test datasets

- Define the distilbert model

- Leverage the Trainer class for faster training, initialize its arguments and define functions for evaluation

Result for epochs = 10

Deployment

https://huggingface.co/LalasaMynalli/LalasaMynalli_First_LLM/resolve/main/README.md

Next Steps

Hyperparameter tuning

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
CybertextBullying_PyTorch.ipynb		CybertextBullying_PyTorch.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Detection of Cyberbullying (Natural Language Processing) 🤬🗣️

Business Application in Social Media

Which pre-trained model is chosen and why❓❔

Coding Blocks 👩‍💻👩‍💻 💬

Result for epochs = 10

About

Releases

Packages

Languages

Lalasa1234/CyberbullyingDetection

Folders and files

Latest commit

History

Repository files navigation

Detection of Cyberbullying (Natural Language Processing) 🤬🗣️

Business Application in Social Media

Which pre-trained model is chosen and why❓❔

Coding Blocks 👩‍💻👩‍💻 💬

Result for epochs = 10

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages