Model Leaderboard — MMLU Benchmark (100 samples)
Ranked by accuracy. Same prompts, same conditions, real API calls.
Note: These are sample results. Run the full framework with your own API keys for live benchmarks.
Full Leaderboard
Key Insights
| Finding | Detail |
|---|---|
| Best Accuracy | GPT-4o (88.2%) and Claude 3.5 Sonnet (87.6%) — nearly tied |
| Best Value | GPT-4o-mini — 78.4% accuracy at $0.0003/1K (27× cheaper than GPT-4o) |
| Fastest | Gemini 1.5 Flash — 380ms avg, $0.0001/1K (cheapest of all) |
| Best Reasoning | Claude 3.5 Sonnet — 8.6/10 reasoning quality score |
| Accuracy Gap | Only 10% separates best and worst — cost differs by 90× |
Run This Yourself
pip install llm-evaluation-framework
llm-eval compare \
--models gpt-4o-mini \
--models claude-3-haiku-20240307 \
--models gemini/gemini-1.5-flash \
--benchmark mmlu --samples 100
Built by vignesh2027 · Star on GitHub · MIT License · Free forever