EvalPlus team aims to build high-quality and precise evaluators to understand LLM performance on code related tasks:
HumanEval and MBPP initially came with limited tests. EvalPlus made HumanEval+ & MBPP+ by extending the tests by 80x/35x for rigorous eval.
Go to EvalPlus LeaderboardBased on Differential Performance Evaluation proposed by our COLM'24 paper, we rigorously evaluate the code efficiency of LLM-generated code with performance-exercising coding tasks and test inputs.
Evalperf LeaderboardRepository understanding is crucial for intelligent code agents. At RepoQA, we are designing evaluators of long-context code understanding.
Learn about RepoQA