πŸ† EvalPlus Leaderboard πŸ†

EvalPlus evaluates AI Coders with rigorous tests.

πŸ“’ News: Beyond correctness, how's their code efficiency? Checkout πŸš€EvalPerf!

github paper
#Modelpass@1
1πŸ₯‡ O1 Preview (Sept 2024)✨⚑89
2πŸ₯ˆ O1 Mini (Sept 2024)✨⚑89
3πŸ₯‰ Qwen2.5-Coder-32B-Instruct✨⚑87.2
4GPT 4o (Aug 2024)✨⚑87.2
5DeepSeek-V3 (Nov 2024)✨⚑86.6
6GPT-4-Turbo (April 2024)✨⚑86.6
7DeepSeek-V2.5 (Nov 2024)✨⚑83.5
8GPT 4o Mini (July 2024)✨⚑83.5
9DeepSeek-Coder-V2-Instruct✨⚑82.3
10Claude Sonnet 3.5 (June 2024)✨⚑81.7
11GPT-4-Turbo (Nov 2023)✨⚑81.7
12Grok Beta✨⚑80.5
13Gemini 1.5 Pro 002✨⚑79.3
14GPT-4 (May 2023)✨⚑79.3
15CodeQwen1.5-7B-Chat✨⚑78.7
16OpenCoder-8B-Instruct✨⚑77.4
17claude-3-opus (Mar 2024)✨⚑77.4
18Gemini 1.5 Flash 002✨⚑75.6
19DeepSeek-Coder-33B-instruct✨⚑75
20Codestral-22B-v0.1✨⚑73.8
21OpenCodeInterpreter-DS-33Bβœ¨πŸ’™βš‘73.8
22WizardCoder-33B-V1.1✨⚑73.2
23Artigenz-Coder-DS-6.7B✨⚑72.6
24Llama3-70B-instruct✨⚑72
25Mixtral-8x22B-Instruct-v0.1✨⚑72
26OpenCodeInterpreter-DS-6.7Bβœ¨πŸ’™βš‘72
27speechless-codellama-34B-v2.0βœ¨πŸ’™βš‘72
28DeepSeek-Coder-6.7B-instruct✨⚑71.3
29DeepSeek-Coder-7B-instruct-v1.5✨⚑71.3
30Magicoder-S-DS-6.7Bβœ¨πŸ’™βš‘71.3
31starchat2-15b-v0.1βœ¨πŸ’šβš‘71.3
32GPT-3.5-Turbo (Nov 2023)✨⚑70.7
33code-millenials-34B✨⚑70.7
34databricks/dbrx-instruct✨⚑70.1
35WaveCoder-Ultra-6.7B✨⚑69.5
36XwinCoder-34B✨⚑69.5
37claude-3-haiku (Mar 2024)✨⚑68.9
38Magicoder-S-CL-7Bβœ¨πŸ’™βš‘67.7
39OpenChat-3.5-7B-0106βœ¨πŸ’™βš‘67.7
40Phind-CodeLlama-34B-v2⚑67.1
41GPT-3.5 (May 2023)✨⚑66.5
42CodeLlama-70B-Instruct✨⚑65.9
43WhiteRabbitNeo-33B-v1✨⚑65.9
44speechless-coder-ds-6.7Bβœ¨πŸ’™βš‘65.9
45WizardCoder-Python-34B-V1.0✨⚑64.6
46claude-3-sonnet (Mar 2024)✨⚑64
47Llama3.1-8B-instruct✨⚑62.8
48speechless-starcoder2-15bβœ¨πŸ’šβš‘62.8
49Mistral Large (Mar 2024)✨⚑62.2
50claude-2 (Mar 2024)✨⚑61.6
51Gemini Pro 1.5✨⚑61
52DeepSeek-Coder-1.3B-instruct✨⚑60.4
53starcoder2-15b-instruct-v0.1βœ¨πŸ’šβš‘60.4
54Code-290k-6.7B-Instructβœ¨πŸ’™βš‘59.7
55Qwen1.5-72B-Chat✨⚑59.1
56Phi-3-mini-4k-instruct✨⚑59.1
57dolphin-2.6-mixtral-8x7bβœ¨πŸ’™βš‘57.3
58Command-R+✨⚑56.7
59Llama3-8B-instruct✨⚑56.7
60Gemini Pro 1.0✨⚑55.5
61Code-13Bβœ¨πŸ’™βš‘52.4
62codegemma-7b-it✨⚑51.8
63speechless-starcoder2-7bβœ¨πŸ’šβš‘51.8
64CodeLlama-70B⚑50.6
65WizardCoder-15B-V1.0✨⚑50.6
66claude-instant-1 (Mar 2024)✨⚑50.6
67speechless-coding-7B-16k-toraβœ¨πŸ’™βš‘50.6
68Code-33Bβœ¨πŸ’™βš‘49.4
69OpenHermes-2.5-Code-290k-13Bβœ¨πŸ’™βš‘48.8
70CodeQwen1.5-7B⚑45.7
71WizardCoder-Python-7B-V1.0✨⚑45.1
72phi-2-2.7B⚑45.1
73DeepSeek-Coder-33B-base⚑44.5
74CodeLlama-34B⚑43.9
75Mistral-codealpaca-7BπŸ’™βš‘42.1
76MistralHermes-CodePro-7B-v1βœ¨πŸ’™βš‘42.1
77codegemma-7b⚑41.5
78speechless-code-mistral-7B-v1.0βœ¨πŸ’™βš‘41.5
79DeepSeek-Coder-6.7B-base⚑39.6
80Mixtral-8x7B-Instruct-v0.1✨⚑39.6
81CodeLlama-13B⚑38.4
82StarCoder2-15BπŸ’šβš‘37.8
83SOLAR-10.7B-Instruct-v1.0βœ¨πŸ’™βš‘37.2
84Mistral-7B-Instruct-v0.2✨⚑36
85CodeLlama-7B⚑35.4
86gemma-1.1-7b-it✨⚑35.4
87xDAN-L1-Chat-RL-v1-7Bβœ¨πŸ’™βš‘32.9
88Python-Code-13Bβœ¨πŸ’™βš‘30.5
89StarCoder2-7BπŸ’šβš‘29.9
90Llama3-8B-base⚑29.3
91StarCoder-15BπŸ’šβš‘29.3
92gemma-7b⚑28.7
93CodeGen-16BπŸ’šβš‘28
94StarCoder2-3BπŸ’šβš‘27.4
95CodeT5+-16BπŸ’šβš‘26.8
96CodeGen-6BπŸ’šβš‘25.6
97DeepSeek-Coder-1.3B-base⚑25.6
98stable-code-3BπŸ’šβš‘25.6
99gemma-7b-it✨⚑25
100CodeT5+-6BπŸ’šβš‘24.4
101Mistral-7B⚑23.8
102Zephyr Ξ²-7BπŸ’™βš‘23.2
103CodeGen-2BπŸ’šβš‘22.6
104CodeT5+-2BπŸ’šβš‘22
105StarCoderBase-7BπŸ’šβš‘21.3
106codegemma-2b⚑20.7
107gemma-2b⚑20.7
108CodeGen2-7BπŸ’šβš‘17.7
109gemma-1.1-2b-it✨⚑17.7
110CodeGen2-16BπŸ’šβš‘16.5
111StarCoderBase-3BπŸ’šβš‘15.9
112Vicuna-13BπŸ’™βš‘15.9
113gemma-2b-it✨⚑15.2
114SantaCoder-1.1BπŸ’šβš‘14
115CodeGen2-3BπŸ’šβš‘12.8
116InCoder-6.7BπŸ’šβš‘12.2
117StarCoderBase-1BπŸ’šβš‘12.2
118Vicuna-7BπŸ’™βš‘11.6
119GPT-J-6BπŸ’šβš‘11
120InCoder-1.3BπŸ’šβš‘11
121CodeGen2-1BπŸ’šβš‘9.1
122GPT-Neo-2.7BπŸ’šβš‘6.7
123PolyCoder-2.7BπŸ’šβš‘6.1
124StableLM-7B⚑2.4
125zyte-1Bβœ¨πŸ’™βš‘1.8
#Modelpass@1
1πŸ₯‡ O1 Preview (Sept 2024)✨96.3
2πŸ₯ˆ O1 Mini (Sept 2024)✨96.3
3πŸ₯‰ GPT 4o (Aug 2024)✨92.7
4Qwen2.5-Coder-32B-Instruct✨92.1
5DeepSeek-V3 (Nov 2024)✨91.5
6DeepSeek-V2.5 (Nov 2024)✨90.2
7GPT-4-Turbo (April 2024)✨90.2
8Gemini 1.5 Pro 002✨89
9Grok Beta✨88.4
10GPT 4o Mini (July 2024)✨88.4
11GPT-4 (May 2023)✨88.4
12Claude Sonnet 3.5 (June 2024)✨87.2
13DeepSeek-Coder-V2-Instruct✨85.4
14GPT-4-Turbo (Nov 2023)✨85.4
15CodeQwen1.5-7B-Chat✨83.5
16claude-3-opus (Mar 2024)✨82.9
17Gemini 1.5 Flash 002✨82.3
18OpenCoder-8B-Instruct✨81.7
19DeepSeek-Coder-33B-instruct✨81.1
20Codestral-22B-v0.1✨79.9
21WizardCoder-33B-V1.1✨79.9
22OpenCodeInterpreter-DS-33Bβœ¨πŸ’™79.3
23Llama3-70B-instruct✨77.4
24OpenCodeInterpreter-DS-6.7Bβœ¨πŸ’™77.4
25speechless-codellama-34B-v2.0βœ¨πŸ’™77.4
26GPT-3.5-Turbo (Nov 2023)✨76.8
27Magicoder-S-DS-6.7Bβœ¨πŸ’™76.8
28claude-3-haiku (Mar 2024)✨76.8
29Mixtral-8x22B-Instruct-v0.1✨76.2
30Artigenz-Coder-DS-6.7B✨75.6
31DeepSeek-Coder-7B-instruct-v1.5✨75.6
32XwinCoder-34B✨75.6
33WaveCoder-Ultra-6.7B✨75
34databricks/dbrx-instruct✨75
35DeepSeek-Coder-6.7B-instruct✨74.4
36code-millenials-34B✨74.4
37starchat2-15b-v0.1βœ¨πŸ’š73.8
38GPT-3.5 (May 2023)✨73.2
39WizardCoder-Python-34B-V1.0✨73.2
40OpenChat-3.5-7B-0106βœ¨πŸ’™72.6
41CodeLlama-70B-Instruct✨72
42WhiteRabbitNeo-33B-v1✨72
43Phind-CodeLlama-34B-v271.3
44speechless-coder-ds-6.7Bβœ¨πŸ’™71.3
45Magicoder-S-CL-7Bβœ¨πŸ’™70.7
46claude-3-sonnet (Mar 2024)✨70.7
47Llama3.1-8B-instruct✨69.5
48Mistral Large (Mar 2024)✨69.5
49claude-2 (Mar 2024)✨69.5
50Qwen1.5-72B-Chat✨68.3
51Gemini Pro 1.5✨68.3
52starcoder2-15b-instruct-v0.1βœ¨πŸ’š67.7
53speechless-starcoder2-15bβœ¨πŸ’š67.1
54DeepSeek-Coder-1.3B-instruct✨65.9
55Code-290k-6.7B-Instructβœ¨πŸ’™64.6
56Phi-3-mini-4k-instruct✨64.6
57Command-R+✨64
58dolphin-2.6-mixtral-8x7bβœ¨πŸ’™64
59Gemini Pro 1.0✨63.4
60Llama3-8B-instruct✨61.6
61codegemma-7b-it✨60.4
62claude-instant-1 (Mar 2024)✨57.3
63WizardCoder-15B-V1.0✨56.7
64Code-13Bβœ¨πŸ’™56.1
65speechless-starcoder2-7bβœ¨πŸ’š56.1
66CodeLlama-70B55.5
67Code-33Bβœ¨πŸ’™54.9
68speechless-coding-7B-16k-toraβœ¨πŸ’™54.9
69OpenHermes-2.5-Code-290k-13Bβœ¨πŸ’™54.3
70CodeLlama-34B51.8
71CodeQwen1.5-7B51.8
72DeepSeek-Coder-33B-base51.2
73WizardCoder-Python-7B-V1.0✨50.6
74phi-2-2.7B49.4
75Mistral-codealpaca-7BπŸ’™48.2
76speechless-code-mistral-7B-v1.0βœ¨πŸ’™48.2
77DeepSeek-Coder-6.7B-base47.6
78MistralHermes-CodePro-7B-v1βœ¨πŸ’™47.6
79StarCoder2-15BπŸ’š46.3
80Mixtral-8x7B-Instruct-v0.1✨45.1
81codegemma-7b44.5
82SOLAR-10.7B-Instruct-v1.0βœ¨πŸ’™43.3
83CodeLlama-13B42.7
84gemma-1.1-7b-it✨42.7
85Mistral-7B-Instruct-v0.2✨42.1
86xDAN-L1-Chat-RL-v1-7Bβœ¨πŸ’™40.2
87CodeLlama-7B37.8
88StarCoder2-7BπŸ’š35.4
89gemma-7b35.4
90StarCoder-15BπŸ’š34.1
91Llama3-8B-base33.5
92CodeGen-16BπŸ’š32.9
93Python-Code-13Bβœ¨πŸ’™32.9
94CodeT5+-16BπŸ’š31.7
95StarCoder2-3BπŸ’š31.7
96Zephyr Ξ²-7BπŸ’™30
97CodeGen-6BπŸ’š29.3
98CodeT5+-6BπŸ’š29.3
99stable-code-3BπŸ’š29.3
100DeepSeek-Coder-1.3B-base28.7
101Mistral-7B28.7
102gemma-7b-it✨28.7
103codegemma-2b26.8
104CodeT5+-2BπŸ’š25
105gemma-2b25
106CodeGen-2BπŸ’š24.4
107StarCoderBase-7BπŸ’š24.4
108gemma-1.1-2b-it✨22.6
109CodeGen2-16BπŸ’š19.5
110CodeGen2-7BπŸ’š18.3
111StarCoderBase-3BπŸ’š17.7
112gemma-2b-it✨17.7
113Vicuna-13BπŸ’™17.1
114CodeGen2-3BπŸ’š15.9
115InCoder-6.7BπŸ’š15.9
116SantaCoder-1.1BπŸ’š14.6
117StarCoderBase-1BπŸ’š14.6
118GPT-J-6BπŸ’š12.2
119InCoder-1.3BπŸ’š12.2
120Vicuna-7BπŸ’™11.6
121CodeGen2-1BπŸ’š11
122GPT-Neo-2.7BπŸ’š7.9
123PolyCoder-2.7BπŸ’š6.1
124StableLM-7B2.4
125zyte-1Bβœ¨πŸ’™2.4

πŸ“ Notes

  1. Evaluated using HumanEval+ version 0.1.10; MBPP+ version 0.2.0.
  2. Models are ranked according to pass@1 using greedy decoding. Setup details can be found here.
  3. ✨ marks models evaluated using a chat setting, while others perform direct code completion.
  4. Both MBPP and MBPP+ referred in our leaderboard use a subset (399 tasks) of hand-verified problems from MBPP-sanitized (427 tasks), to make sure the programming task is well-formed (e.g., test_list is not wrong).
  5. Model providers have the responsibility to avoid data contamination. Models trained on close data can be affected by contamination.
  6. πŸ’š means open weights and open data. πŸ’™ means open weights and open SFT data, but the base model is not data-open. What does this imply? πŸ’šπŸ’™ models open-source the data such that one can concretely reason about contamination.
  7. "Size" here is the amount of activated model weight during inference.

πŸ€— More Leaderboards

In addition to EvalPlus leaderboards, it is recommended to comprehensively understand LLM coding ability through a diverse set of benchmarks and leaderboards, such as: