AI Analytics - Comparing Performance of GenAI Models

This repository was created to compare the performance of foundational models at different tasks and levels of complexity, using visualisation and statistics.

Data:

Data was facilitated by DataAnnotationTech

Categories:

1. Adversarial Dishones	8. Extraction
2. Adversarial Harmfulness	9. Mathemathical Reasoning
3. Brain Storming	10. Open QA
4. Classification	11. Poetry
5. Closed QA	12. Rewriting
6. Creative Writing	13. Summarization
7. Coding

Likertype rating scale:

Bard much better
Bard better
Bard slightly better
About the same
ChatGPT slightly better
ChatGPT better
Chat GPT much better

Tools used: pandas, plotly, statsmodels and scipy and scikit-posthocs

Bard vs ChatGPT

Statistical comparison**

Note: Imbalance dataset, a prime number of prompts 1003, Bard was not rated "Bard much better in the Poetry Category" and "Bard better" in the category Creative Writing for simple prompts.

Chi-square with Monte Carlo iterations p-value: 0.0001
Kruskal-Wallis p-value: 6.96E-7
Multinomila Logistic regression p-value: 0.00015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files

README.md

README.md

AI Analytics - Comparing Performance of GenAI Models

Bard vs ChatGPT

Statistical comparison**

Files

README.md

Latest commit

History

README.md

File metadata and controls

AI Analytics - Comparing Performance of GenAI Models

Bard vs ChatGPT

Statistical comparison**