This repository was created to compare the performance of foundational models at different tasks and levels of complexity, using visualisation and statistics.
Data:
Data was facilitated by DataAnnotationTech
Categories:
1. Adversarial Dishones | 8. Extraction |
2. Adversarial Harmfulness | 9. Mathemathical Reasoning |
3. Brain Storming | 10. Open QA |
4. Classification | 11. Poetry |
5. Closed QA | 12. Rewriting |
6. Creative Writing | 13. Summarization |
7. Coding |
Likertype rating scale:
- Bard much better
- Bard better
- Bard slightly better
- About the same
- ChatGPT slightly better
- ChatGPT better
- Chat GPT much better
Tools used: pandas
, plotly
, statsmodels
and scipy
and scikit-posthocs
Note: Imbalance dataset, a prime number of prompts 1003, Bard was not rated "Bard much better in the Poetry Category" and "Bard better" in the category Creative Writing for simple prompts.
- Chi-square with Monte Carlo iterations p-value: 0.0001
- Kruskal-Wallis p-value: 6.96E-7
- Multinomila Logistic regression p-value: 0.00015
|
|
|
|
|