Skip to content

Files

Latest commit

f03ad0c · Mar 17, 2025

History

History
99 lines (79 loc) · 2.25 KB

README.md

File metadata and controls

99 lines (79 loc) · 2.25 KB

AI Analytics - Comparing Performance of GenAI Models

This repository was created to compare the performance of foundational models at different tasks and levels of complexity, using visualisation and statistics.

Data:

Data was facilitated by DataAnnotationTech

Categories:

1. Adversarial Dishones 8. Extraction
2. Adversarial Harmfulness 9. Mathemathical Reasoning
3. Brain Storming 10. Open QA
4. Classification 11. Poetry
5. Closed QA 12. Rewriting
6. Creative Writing 13. Summarization
7. Coding

Likertype rating scale:

  1. Bard much better
  2. Bard better
  3. Bard slightly better
  4. About the same
  5. ChatGPT slightly better
  6. ChatGPT better
  7. Chat GPT much better

Tools used: pandas, plotly, statsmodels and scipy and scikit-posthocs

Bard vs ChatGPT

Statistical comparison**

Note: Imbalance dataset, a prime number of prompts 1003, Bard was not rated "Bard much better in the Poetry Category" and "Bard better" in the category Creative Writing for simple prompts.

  • Chi-square with Monte Carlo iterations p-value: 0.0001
  • Kruskal-Wallis p-value: 6.96E-7
  • Multinomila Logistic regression p-value: 0.00015