Skip to content

This repository is created to compared foundational models at different task, using visualisation and statistics.

Notifications You must be signed in to change notification settings

Wb-az/AI-Performance-Analytics-LLMs

Repository files navigation

AI Analytics - Comparing Performance of GenAI Models

This repository was created to compare the performance of foundational models at different tasks and levels of complexity, using visualisation and statistics.

Data:

Data was facilitated by DataAnnotationTech

Categories:

1. Adversarial Dishones 8. Extraction
2. Adversarial Harmfulness 9. Mathemathical Reasoning
3. Brain Storming 10. Open QA
4. Classification 11. Poetry
5. Closed QA 12. Rewriting
6. Creative Writing 13. Summarization
7. Coding

Likertype rating scale:

  1. Bard much better
  2. Bard better
  3. Bard slightly better
  4. About the same
  5. ChatGPT slightly better
  6. ChatGPT better
  7. Chat GPT much better

Tools used: pandas, plotly, statsmodels and scipy and scikit-posthocs

Bard vs ChatGPT

Statistical comparison**

Note: Imbalance dataset, a prime number of prompts 1003, Bard was not rated "Bard much better in the Poetry Category" and "Bard better" in the category Creative Writing for simple prompts.

  • Chi-square with Monte Carlo iterations p-value: 0.0001
  • Kruskal-Wallis p-value: 6.96E-7
  • Multinomila Logistic regression p-value: 0.00015