EvalPlus team aims to build high-quality benchmarks for evaluating LLMs for code. Below are the benchmarks we have beening building so far:
HumanEval and MBPP initially came with limited tests. EvalPlus made HumanEval+ & MBPP+ by extending the tests by 80x/35x for rigorous eval.
Go to EvalPlus LeaderboardRepository understanding is crucial for intelligent code agents. At RepoQA, we are designing evaluators of long-context code understanding.
Learn about RepoQA