Evaluating LLMs for Efficient Code Generation

🚀 LLM-oriented code efficiency evaluation requires:

Performance-exercising tasks & inputs -- "all complexities are equal when N is small"
Meaningful compound metric -- avg. speedup does not fit multi-task evaluation

🛍️ Based on our methodology, the EvalPerf dataset (current version 20240328) includes:

118 performance-exercising tasks
Each task is equipped with a computationally challenging test input generated by the SaS generator
Differential Performance Score (DPS): "DPS=80" means "submissions can outperform 80% LLM solutions"

🦾 The reliability of EvalPerf comes from:

Correctness ablation: Pairwise comparison of LLMs' code efficiency over common passing tasks
Anti-flakiness: (1) long computation -> low runtime variation (Paper Fig. 6); (2) #instructions as primitive metric; & (3) DPS compares the given solution with reference solutions on the same test bed. -- These leads to low cross-platform variation (Paper Tab. 2)

Check out our COLM'24 poster and the latest experimental configurations for more details!

          
pip install --upgrade "evalplus[perf,vllm] @ git+https://github.com/evalplus/evalplus"
# Or `pip install "evalplus[perf,vllm]" --upgrade` for the latest stable release

sudo sh -c 'echo 0 > /proc/sys/kernel/perf_event_paranoid' # Enable perf
evalplus.evalperf --model "ise-uiuc/Magicoder-S-DS-6.7B" --backend vllm

Recommended comparison format:

📊 Win-rate ranking -- Each race round compares two models' DPS based on common passing tasks
🔥 Pairwise DPS in a Heatmap -- Computing DPS for 2 compared models on their common passing tasks

Win-rate Leaderboard

📊 Ranking metrics: WR (Win-Rate; %) based on task- and model-wise competiton (i.e., pairwise DPS).

📝 Notes: the default prompt does not emphasize efficiency requirements as our work shows such emphasis might degrade both efficiency and correctness for some weak models. Yet, "(⏩)" marks models using performance-encouraging prompts as they might be able to accurately understand such needs.

📐 Show more metrics:

pass@1

DPS

🏪 The detailed model generation data and results are available at our page repository.

💸 We use 50 samples (half) for o1 model series for cost saving; also because it's easy to sample desired amount of correct samples from strong models using less tries.

Heatmap of Pairwise DPS Comparison

What's DPS? Differential Performance Score (DPS) is a LeetCode-inspired metric, which shows the overall code efficiency ranking percentile (0-100%) based on the LLM-generated code. For example, "DPS=80" means the LLM's "submissions can outperform/match 80% LLM solutions."

Model Selection

Examinee (Left)

Reference (Bottom)

💡 Tips: float the mouse over the heatmap to see detailed DPS of the compared two models.

Adding and visualizing new model results?


git clone git@github.com:evalplus/evalplus.github.io.git
cd evalplus.github.io && git pull
cp ${PATH_TO}/${MODEL}_temp_1.0_evalperf_results.brief.json results/evalperf
python results/evalperf/stats.py && python -m http.server 8000
# Open the displayed address in your browser

          
@inproceedings{evalperf,
  title = {Evaluating Language Models for Efficient Code Generation},
  author = {Liu, Jiawei and Xie, Songrun and Wang, Junhao and Wei, Yuxiang and Ding, Yifeng and Zhang, Lingming},
  booktitle = {First Conference on Language Modeling},
  year = {2024},
  url = {https://openreview.net/forum?id=IBCBMeAhmC},
}

We thank OpenAI Researcher Access Program for providing part of the compute.

Evaluating LLMs for Efficient Code Generation

Win-rate Leaderboard

Heatmap of Pairwise DPS Comparison

Model Selection

Adding and visualizing new model results?

🖊️ Citation

🤗 Acknowledgment