EvalPlus Logo

Benchmarks @ EvalPlus

EvalPlus team aims to build high-quality benchmarks for evaluating LLMs for code. Below are the benchmarks we have beening building so far:

HumanEval+ & MBPP+

HumanEval and MBPP initially came with limited tests. EvalPlus made HumanEval+ & MBPP+ by extending the tests by 80x/35x for rigorous eval.

Go to EvalPlus Leaderboard

RepoQA: Long-Context Code Understanding

Repository understanding is crucial for intelligent code agents. At RepoQA, we are designing evaluators of long-context code understanding.

Learn about RepoQA