π EvalPlus Leaderboard π
EvalPlus evaluates AI Coders with rigorous tests.
π’ News: Beyond correctness, how's their code efficiency? Checkout πEvalPerf!
π Notes
- Evaluated using HumanEval+ version 0.1.10; MBPP+ version 0.2.0.
- Models are ranked according to pass@1 using greedy decoding. Setup details can be found here.
- β¨ marks models evaluated using a chat setting, while others perform direct code completion.
- Both MBPP and MBPP+ referred in our leaderboard use a subset (399 tasks) of hand-verified problems from MBPP-sanitized (427 tasks), to make sure the programming task is well-formed (e.g., test_list is not wrong).
- Model providers have the responsibility to avoid data contamination. Models trained on close data can be affected by contamination.
- π means open weights and open data. π means open weights and open SFT data, but the base model is not data-open. What does this imply? ππ models open-source the data such that one can concretely reason about contamination.
- "Size" here is the amount of activated model weight during inference.