EvalPlus Leaderboard

📢 News: Beyond correctness, how's their code efficiency? Checkout 🚀EvalPerf!

Evaluated using HumanEval+ version 0.1.10; MBPP+ version 0.2.0.
Models are ranked according to pass@1 using greedy decoding. Setup details can be found here.
✨ marks models evaluated using a chat setting, while others perform direct code completion.
Both MBPP and MBPP+ referred in our leaderboard use a subset (399 tasks) of hand-verified problems from MBPP-sanitized (427 tasks), to make sure the programming task is well-formed (e.g., test_list is not wrong).
Model providers have the responsibility to avoid data contamination. Models trained on close data can be affected by contamination.
💚 means open weights and open data. 💙 means open weights and open SFT data, but the base model is not data-open. What does this imply? 💚💙 models open-source the data such that one can concretely reason about contamination.
"Size" here is the amount of activated model weight during inference.

🏆 EvalPlus Leaderboard 🏆