Skip to content

FearfulTomcat27/MultiPL-E

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Multi-Programming Language Evaluation of Large Language Models of Code (MultiPL-E)

MultiPL-E is a system for translating unit test-driven neural code generation benchmarks to new languages. We have used MultiPL-E to translate two popular Python benchmarks (HumanEval and MBPP) to 18 other programming languages.

For more information:

Versions

  • Version 3.0

    • We are going to maintain the changelog on the dataset page: https://huggingface.co/datasets/nuprl/MultiPL-E
    • The dataset was versioned at 3.0, and we are bumping the software version to stay in sync.
    • We have published several new PLs in the dataset. However, we have not included these PLs at this time: Dafny, Coq, Lean, Luau, and MATLAB.
  • Version 0.5.0: Instruction-following support and new languages

    • New languages: Luau, Elixir, Lean, Coq, Dafny
    • Support for instruction-following prompts
    • vLLM support for faster evaluation
  • Version 0.4.0: QoL improvements and new languages

    • New languages: OCaml, MATLAB
    • Using .jsonl instead of .json for prompts
    • Several bugfixes to prompts
  • Version 0.3.0: used to evaluate StarCoder

    • This version corrects several bugs in prompts and test cases that resulted in lower pass@k rates for some of the statically typed languages. The most significant difference is that the pass@k for Java increases by about 2% on HumanEval.
  • Version 0.2.0: used to evaluate SantaCoder

About

A multi-programming language benchmark for LLMs

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 94.1%
  • Lua 4.8%
  • Jupyter Notebook 0.5%
  • Shell 0.4%
  • Dockerfile 0.2%
  • C++ 0.0%