Skip to content

Latest commit

 

History

History
 
 

discoverybench

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 

DiscoveryBench with OpenHands

DiscoveryBench (Paper) contains 264 tasks collected across 6 diverse domains, such as biology, economics, and sociology. It incorporates discovery workflows from published papers to approximate the real-world challenges faced by researchers.

DiscoveryBench Background

Setup Environment and LLM Configuration

  1. Please follow instructions mentioned here to setup OpenHands development environment and LLMs locally

  2. Execute the bash script to start DiscoveryBench Evaluation

./evaluation/discoverybench/scripts/run_infer.sh [YOUR MODEL CONFIG]

Replace [YOUR MODEL CONFIG] with any model the model that you have set up in config.toml

Run Inference on DiscoveryBench Instances

When the run_infer.sh script is started, it will automatically pull the latest DiscoveryBench instances & set up the agent environment. The OpenHands agent is invoked to process the task within this environment, producing a hypothesis. We then evaluate it against the “gold” hypothesis provided by DiscoveryBench. The evaluation result, along with the agent chat history is logged to output.jsonl under evaluation_outputs.

./evaluation/discoverybench/scripts/run_infer.sh [MODEL_CONFIG] [GIT_COMMIT] [AGENT] [EVAL_LIMIT] [NUM_WORKERS]
  • MODEL_CONFIG: Name of the model you want to evaluate with
  • GIT_COMMIT: This should be the git commit hash or release tag for OpenHands, e.g., HEAD or a specific tag like 0.6.2.
  • AGENT: Use CoderActAgent, right now it only supports that.
  • EVAL_LIMIT: Number of samples to evaluate.
  • NUM_WORKERS: Number of workers to parallelize the evaluation process.