π EvalPlus Leaderboard π
EvalPlus evaluates AI Coders with rigorous tests.
π Notes
- All samples are generated from scratch and are uniformly post-processed by our sanitizer script. Syntactical checkers are used to make sure that trivial syntactical errors (e.g., Python indents) do not contribute to failing tests.
- By default, models are ranked according to pass@1 using greedy decoding. Model setup details can be found here.
- Models labelled with ποΈ are evaluated using an instruction/chat setting, while others perform direct code generation given the prompt.
- We only use a subset of well-formed problems (399 tasks) from MBPP-sanitized (427 tasks) for MBPP/MBPP+.
- It is the model providers' responsibility to avoid data contamination as much as possible. In other words, we cannot guarantee if the evaluated models are contaminated or not.
π€ More Leaderboards
In addition to EvalPlus leaderboards, it is recommended to comprehensively understand LLM coding ability through a diverse set of benchmarks and leaderboards, such as:
π Notes
- All samples are generated from scratch and are uniformly post-processed by our sanitizer script. Syntactical checkers are used to make sure that trivial syntactical errors (e.g., Python indents) do not contribute to failing tests.
- By default, models are ranked according to pass@1 using greedy decoding. Model setup details can be found here.
- Models labelled with ποΈ are evaluated using an instruction/chat setting, while others perform direct code generation given the prompt.
- We only use a subset of well-formed problems (399 tasks) from MBPP-sanitized (427 tasks) for MBPP/MBPP+.
- It is the model providers' responsibility to avoid data contamination as much as possible. In other words, we cannot guarantee if the evaluated models are contaminated or not.
π€ More Leaderboards
In addition to EvalPlus leaderboards, it is recommended to comprehensively understand LLM coding ability through a diverse set of benchmarks and leaderboards, such as: