Name		Name	Last commit message	Last commit date
parent directory ..
PII-98-benchmark		PII-98-benchmark
SVAMP-49-benchmark		SVAMP-49-benchmark
TriviaQA-114-benchmark		TriviaQA-114-benchmark
README.md		README.md
openai_o1_preview_benchmark_reproduce.ipynb		openai_o1_preview_benchmark_reproduce.ipynb
tlm_o1_preview_benchmark_reproduce.ipynb		tlm_o1_preview_benchmark_reproduce.ipynb

README.md

TLM o1-preview Benchmark

This folder contains the dataset and the code to reproduce the TLM-o1 benchmark we published in our blog post.

For SVAMP and TriviaQA, we specifically selected challenging examples that OpenAI’s GPT-4o model got wrong, as OpenAI’s o1-preview API is still slow and costly to benchmark across larger datasets. For our TriviaQA benchmark, we used 114 examples from the validation set which GPT-4o answered wrong, and we were able to manually confirm the answer listed as ground truth is actually correct. For our SVAMP benchmark, we used 49 examples which GPT-4o answered wrong, and we were able to manually confirm the answer listed as ground truth is actually correct. For our PII Detection benchmark, we specifically focused on identifying first-names present in the text, considering a dataset of 98 examples.

API keys:

A CLEANLAB_API_KEY is required to run this project. Get a Cleanlab API key at https://app.cleanlab.ai/tlm
An OPENAI_API_KEY is also required to run this project. Get an OpenAI key at https://platform.openai.com/api-keys

To reproduce the benchmarks:

Use the openai_o1_preview_benchmark_reproduce.ipynb file to reproduce the OpenAI o1 benchmark.
Use the tlm_o1_preview_benchmark_reproduce.ipynb file to reproduce the TLM o1-preview benchmark.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TLM-o1-benchmark

TLM-o1-benchmark

README.md

TLM o1-preview Benchmark

Files

TLM-o1-benchmark

Directory actions

More options

Directory actions

More options

Latest commit

History

TLM-o1-benchmark

Folders and files

parent directory

README.md

TLM o1-preview Benchmark