Evaluating LLMs for Efficient Code Generation
🚀 LLM-oriented code efficiency evaluation requires:
- Performance-exercising tasks & inputs -- "all complexities are equal when N is small"
- Meaningful compound metric -- avg. speedup does not fit multi-task evaluation
🛍️ Based on our methodology, the EvalPerf dataset (current version 20240328) includes:
- 118 performance-exercising tasks
- Each task is equipped with a computationally challenging test input generated by the SaS generator
- Differential Performance Score (DPS): "DPS=80" means "submissions can outperform 80% LLM solutions"
🦾 The reliability of EvalPerf comes from:
- Correctness ablation: Pairwise comparison of LLMs' code efficiency over common passing tasks
- Anti-flakiness: (1) long computation -> low runtime variation (Paper Fig. 6); (2) #instructions as primitive metric; & (3) DPS compares the given solution with reference solutions on the same test bed. -- These leads to low cross-platform variation (Paper Tab. 2)
pip install --upgrade "evalplus[perf,vllm] @ git+https://github.com/evalplus/evalplus"
# Or `pip install "evalplus[perf,vllm]" --upgrade` for the latest stable release
sudo sh -c 'echo 0 > /proc/sys/kernel/perf_event_paranoid' # Enable perf
evalplus.evalperf --model "ise-uiuc/Magicoder-S-DS-6.7B" --backend vllm
- 📊 Win-rate ranking -- Each race round compares two models' DPS based on common passing tasks
- 🔥 Pairwise DPS in a Heatmap -- Computing DPS for 2 compared models on their common passing tasks
Win-rate Leaderboard
📊 Ranking metrics: WR (Win-Rate; %) based on task- and model-wise competiton (i.e., pairwise DPS).
📝 Notes: the default prompt does not emphasize efficiency requirements as our work shows such emphasis might degrade both efficiency and correctness for some weak models. Yet, "(⏩)" marks models using performance-encouraging prompts as they might be able to accurately understand such needs.
📐 Show more metrics:
🏪 The detailed model generation data and results are available at our page repository.
💸 We use 50 samples (half) for o1 model series for cost saving; also because it's easy to sample desired amount of correct samples from strong models using less tries.
Heatmap of Pairwise DPS Comparison
What's DPS? Differential Performance Score (DPS) is a LeetCode-inspired metric, which shows the overall code efficiency ranking percentile (0-100%) based on the LLM-generated code. For example, "DPS=80" means the LLM's "submissions can outperform/match 80% LLM solutions."
Model Selection
Adding and visualizing new model results?
git clone git@github.com:evalplus/evalplus.github.io.git
cd evalplus.github.io && git pull
cp ${PATH_TO}/${MODEL}_temp_1.0_evalperf_results.brief.json results/evalperf
python results/evalperf/stats.py && python -m http.server 8000
# Open the displayed address in your browser
🖊️ Citation
@inproceedings{evalperf,
title = {Evaluating Language Models for Efficient Code Generation},
author = {Liu, Jiawei and Xie, Songrun and Wang, Junhao and Wei, Yuxiang and Ding, Yifeng and Zhang, Lingming},
booktitle = {First Conference on Language Modeling},
year = {2024},
url = {https://openreview.net/forum?id=IBCBMeAhmC},
}
🤗 Acknowledgment
We thank OpenAI Researcher Access Program for providing part of the compute.