Skip to content

Latest commit

 

History

History

TLM-o1-benchmark

TLM o1-preview Benchmark

This folder contains the dataset and the code to reproduce the TLM-o1 benchmark we published in our blog post.

For SVAMP and TriviaQA, we specifically selected challenging examples that OpenAI’s GPT-4o model got wrong, as OpenAI’s o1-preview API is still slow and costly to benchmark across larger datasets. For our TriviaQA benchmark, we used 114 examples from the validation set which GPT-4o answered wrong, and we were able to manually confirm the answer listed as ground truth is actually correct. For our SVAMP benchmark, we used 49 examples which GPT-4o answered wrong, and we were able to manually confirm the answer listed as ground truth is actually correct. For our PII Detection benchmark, we specifically focused on identifying first-names present in the text, considering a dataset of 98 examples.

API keys:

To reproduce the benchmarks:

  • Use the openai_o1_preview_benchmark_reproduce.ipynb file to reproduce the OpenAI o1 benchmark.
  • Use the tlm_o1_preview_benchmark_reproduce.ipynb file to reproduce the TLM o1-preview benchmark.