MLE-bench

An eval for automated AI research

Performance Overview

Updated 1 day ago · view source
ML-Master 2.0 (Deepseek-V3.2-Speciale)
56.4%
Leeroo (Gemini-3-Pro-Preview)
50.7%
Thesis (gpt-5-codex)
48.4%
CAIR MLE-STAR-Pro-1.5 (Gemini-2.5-Pro)
44.0%
FM Agent (Gemini-2.5-Pro)
43.6%
Operand ensemble (gpt-5 (low verbosity/effort))
39.6%
CAIR MLE-STAR-Pro-1.0 (Gemini-2.5-Pro)
38.7%
InternAgent (deepseek-r1)
36.4%
R&D-Agent (gpt-5)
35.1%
Neo multi-agent (undisclosed)
34.2%
AIRA-dojo (o3)
31.6%
R&D-Agent (o3 + GPT-4.1)
30.2%
ML-Master (deepseek-r1)
29.3%
R&D-Agent (o1-preview)
22.4%
AIDE (o1-preview)
17.1%
AIDE (gpt-4o-2024-08-06)
8.6%
AIDE (claude-3-5-sonnet-20240620)
7.6%
OpenHands (gpt-4o-2024-08-06)
4.9%
AIDE (llama-3.1-405b-instruct)
3.3%
MLAB (gpt-4o-2024-08-06)
1.6%

Leaderboard

Updated 1 day ago · view source
Agent LLM Low/Lite Medium High Overall Time Date Reports Source
ML-Master 2.0 Deepseek-V3.2-Speciale 75.76 ± 1.51 50.88 ± 3.51 42.22 ± 2.22 56.44 ± 2.47 24h 2025-12-16 Available -
Leeroo Gemini-3-Pro-Preview 68.18 ± 2.62 44.74 ± 1.52 40.00 ± 0.00 50.67 ± 1.33 24h 2025-12-07 Available -
Thesis gpt-5-codex 65.15 ± 1.52 45.61 ± 7.18 31.11 ± 2.22 48.44 ± 3.64 24h 2025-11-10 Available -
CAIR MLE-STAR-Pro-1.5 Gemini-2.5-Pro 68.18 ± 2.62 34.21 ± 1.52 33.33 ± 0.00 44.00 ± 1.33 24h 2025-11-25 Available -
FM Agent Gemini-2.5-Pro 62.12 ± 1.52 36.84 ± 1.52 33.33 ± 0.00 43.56 ± 0.89 24h 2025-10-10 Available -
Operand ensemble gpt-5 (low verbosity/effort) 63.64 ± 0.00 33.33 ± 0.88 20.00 ± 0.00 39.56 ± 0.44 24h 2025-10-06 Available -
CAIR MLE-STAR-Pro-1.0 Gemini-2.5-Pro 66.67 ± 1.52 25.44 ± 0.88 31.11 ± 2.22 38.67 ± 0.77 12h 2025-11-03 Available -
InternAgent deepseek-r1 62.12 ± 3.03 26.32 ± 2.63 24.44 ± 2.22 36.44 ± 1.18 12h 2025-09-12 Available -
R&D-Agent gpt-5 68.18 ± 2.62 21.05 ± 1.52 22.22 ± 2.22 35.11 ± 0.44 12h 2025-09-26 Available Available
Neo multi-agent undisclosed 48.48 ± 1.52 29.82 ± 2.32 24.44 ± 2.22 34.22 ± 0.89 36h 2025-07-28 Available -
AIRA-dojo o3 55.00 ± 1.47 21.97 ± 1.17 21.67 ± 1.07 31.60 ± 0.82 24h 2025-05-15 Available Available
R&D-Agent o3 + GPT-4.1 51.52 ± 4.01 19.30 ± 3.16 26.67 ± 0.00 30.22 ± 0.89 24h 2025-08-15 Available Available
ML-Master deepseek-r1 48.48 ± 1.52 20.18 ± 2.32 24.44 ± 2.22 29.33 ± 0.77 12h 2025-06-17 Available Available
R&D-Agent o1-preview 48.18 ± 1.11 8.95 ± 1.05 18.67 ± 1.33 22.40 ± 0.50 24h 2025-05-14 Available Available
AIDE o1-preview 35.91 ± 1.86 8.45 ± 0.43 11.67 ± 1.27 17.12 ± 0.61 24h 2024-10-08 Available Available
AIDE gpt-4o-2024-08-06 18.55 ± 1.26 3.06 ± 0.33 8.15 ± 0.84 8.63 ± 0.54 24h 2024-10-08 Available Available
AIDE claude-3-5-sonnet-20240620 19.70 ± 1.52 2.63 ± 1.52 2.22 ± 2.22 7.56 ± 1.60 24h 2024-10-08 Available Available
OpenHands gpt-4o-2024-08-06 12.12 ± 1.52 1.75 ± 0.88 2.22 ± 2.22 4.89 ± 0.44 24h 2024-10-08 Available Available
AIDE llama-3.1-405b-instruct 10.23 ± 1.14 0.66 ± 0.66 0.00 ± 0.00 3.33 ± 0.38 24h 2024-10-08 Available Available
MLAB gpt-4o-2024-08-06 4.55 ± 0.86 0.00 ± 0.00 0.00 ± 0.00 1.60 ± 0.27 24h 2024-10-08 Available Available