Benchmarks @ EvalPlus

EvalPlus team aims to build high-quality and precise evaluators to understand LLM performance on code related tasks:

🔨 HumanEval+ & MBPP+

HumanEval and MBPP initially came with limited tests. EvalPlus made HumanEval+ & MBPP+ by extending the tests by 80x/35x for rigorous eval.

Go to EvalPlus Leaderboard

🚀 EvalPerf: Code Efficiency Evaluation

Based on Differential Performance Evaluation proposed by our COLM'24 paper, we rigorously evaluate the code efficiency of LLM-generated code with performance-exercising coding tasks and test inputs.

Evalperf Leaderboard

📦 RepoQA: Long-Context Code Understanding

Repository understanding is crucial for intelligent code agents. At RepoQA, we are designing evaluators of long-context code understanding.

Learn about RepoQA