πŸ† EvalPlus Leaderboard πŸ†

EvalPlus evaluates AI Coders with rigorous tests.

πŸ“’ News: Beyond correctness, how's their code efficiency? Checkout πŸš€EvalPerf!

github paper

πŸ“ Notes

  1. Evaluated using HumanEval+ version 0.1.10; MBPP+ version 0.2.0.
  2. Models are ranked according to pass@1 using greedy decoding. Setup details can be found here.
  3. ✨ marks models evaluated using a chat setting, while others perform direct code completion.
  4. Both MBPP and MBPP+ referred in our leaderboard use a subset (399 tasks) of hand-verified problems from MBPP-sanitized (427 tasks), to make sure the programming task is well-formed (e.g., test_list is not wrong).
  5. Model providers have the responsibility to avoid data contamination. Models trained on close data can be affected by contamination.
  6. πŸ’š means open weights and open data. πŸ’™ means open weights and open SFT data, but the base model is not data-open. What does this imply? πŸ’šπŸ’™ models open-source the data such that one can concretely reason about contamination.
  7. "Size" here is the amount of activated model weight during inference.

πŸ€— More Leaderboards

In addition to EvalPlus leaderboards, it is recommended to comprehensively understand LLM coding ability through a diverse set of benchmarks and leaderboards, such as: