improving-short-text-classification

A repository to store code and computational results for the paper "Improving Short Text Classification With Augmented Data Using GPT-3" by Salvador Balkus and Donghui Yan.

Data

The research_data directory contains the raw question file collected from the University of Massachusetts Dartmouth Big Data Club Discord Server, as well as tables of questions annotator by each reviewer. Those with names followed by (1), (2), and (3) are the training, validation, and test sets used for training the original model, while those followed by reviewer# measure inter-annotator agreement only. Names are redacted for privacy.

Code

icego (which stands for In-Context Example Genetic Optimizer) contains code implementing the genetic algorithm as a Python module. It contains the functions to set up the genetic algorithm, evaluate accuracy of candidates, iterate through a specific number of generators, et cetera. The main model is implemented by incontext_optimizer.py, which draws on classes defined in the population.py and candidate.py to maintain a population of candidates. cost_optimizer.py contains optional helper functions.

Notebooks

The Jupyter notebooks at the root of the project directory contain the computational experiments used to evaluate the two algorithms detailed in the paper.

Train-Test Split Mar 23.ipynb creates the training, validation, and test sets for the study.
Classification Endpoint File Creation March 23.ipynb generates sample augmented datasets for the Classification Endpoint evaluation, as well as plotting and calculating time taken.
Classification Endpoint Sampling.ipynb evaluates the performance of the augmented Classification Endpoint using grid-search cross-validation.
Completion Test March 23.ipynb evaluates the performance of the Completion Endpoint.

Model Outputs

The data and model outputs from the Jupyter notebooks are stored in the saved_models directory. Figures for the paper are output in the root of the directory.

Note: Classification Endpoint time was measured directly from the OpenAI API webpage and saved in saved_models

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
icego		icego
research_data		research_data
saved_models		saved_models
Classification Endpoint Mar 23.ipynb		Classification Endpoint Mar 23.ipynb
Classification Endpoint Sampling.ipynb		Classification Endpoint Sampling.ipynb
Completion Test March 23.ipynb		Completion Test March 23.ipynb
LICENSE		LICENSE
README.md		README.md
Train-Test Split Mar 23.ipynb		Train-Test Split Mar 23.ipynb
classification-small.png		classification-small.png
genetic-small.png		genetic-small.png
genetic2-small.png		genetic2-small.png
replacement.png		replacement.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

improving-short-text-classification

Data

Code

Notebooks

Model Outputs

About

Releases

Packages

Languages

License

salbalkus/improving-short-text-classification

Folders and files

Latest commit

History

Repository files navigation

improving-short-text-classification

Data

Code

Notebooks

Model Outputs

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages