DiscoveryBench (Paper) contains 264 tasks collected across 6 diverse domains, such as biology, economics, and sociology. It incorporates discovery workflows from published papers to approximate the real-world challenges faced by researchers.
-
Please follow instructions mentioned here to setup OpenHands development environment and LLMs locally
-
Execute the bash script to start DiscoveryBench Evaluation
./evaluation/discoverybench/scripts/run_infer.sh [YOUR MODEL CONFIG]
Replace [YOUR MODEL CONFIG]
with any model the model that you have set up in config.toml
When the run_infer.sh
script is started, it will automatically pull the latest DiscoveryBench instances & set up the agent environment. The OpenHands agent is invoked to process the task within this environment, producing a hypothesis. We then evaluate it against the “gold” hypothesis provided by DiscoveryBench. The evaluation result, along with the agent chat history is logged to output.jsonl
under evaluation_outputs
.
./evaluation/discoverybench/scripts/run_infer.sh [MODEL_CONFIG] [GIT_COMMIT] [AGENT] [EVAL_LIMIT] [NUM_WORKERS]
MODEL_CONFIG
: Name of the model you want to evaluate withGIT_COMMIT
: This should be the git commit hash or release tag for OpenHands, e.g., HEAD or a specific tag like 0.6.2.AGENT
: Use CoderActAgent, right now it only supports that.EVAL_LIMIT
: Number of samples to evaluate.NUM_WORKERS
: Number of workers to parallelize the evaluation process.