All samples are generated from scratch and are uniformly post-processed by our sanitizer script. Syntactical checkers are used to make sure that trivial syntactical errors (e.g., Python indents) do not contribute to failing tests.
By default, models are ranked according to pass@1 using greedy decoding. Model setup details can be found here.
Models labelled with 🗒️ are evaluated using an instruction/chat setting, while others perform direct code generation given the prompt.
We only use a subset of well-formed problems (399 tasks) from MBPP-sanitized (427 tasks) for MBPP/MBPP+.
It is the model providers' responsibility to avoid data contamination as much as possible. In other words, we cannot guarantee if the evaluated models are contaminated or not.
🤗 More Leaderboards
In addition to EvalPlus leaderboards, it is recommended to comprehensively understand LLM coding ability through a diverse set of benchmarks and leaderboards, such as: