LLM Evaluation Framework

Benchmark GPT-4 · Claude · Gemini · Mistral · Llama — Side by Side

Accuracy · Latency · Cost · Hallucination Rate · Reasoning Quality

Model Leaderboard — MMLU Benchmark (100 samples)

Ranked by accuracy. Same prompts, same conditions, real API calls.

Note: These are sample results. Run the full framework with your own API keys for live benchmarks.

Full Leaderboard

Key Insights

Finding Detail
Best Accuracy GPT-4o (88.2%) and Claude 3.5 Sonnet (87.6%) — nearly tied
Best Value GPT-4o-mini — 78.4% accuracy at $0.0003/1K (27× cheaper than GPT-4o)
Fastest Gemini 1.5 Flash — 380ms avg, $0.0001/1K (cheapest of all)
Best Reasoning Claude 3.5 Sonnet — 8.6/10 reasoning quality score
Accuracy Gap Only 10% separates best and worst — cost differs by 90×

Run This Yourself

pip install llm-evaluation-framework
llm-eval compare \
  --models gpt-4o-mini \
  --models claude-3-haiku-20240307 \
  --models gemini/gemini-1.5-flash \
  --benchmark mmlu --samples 100

Built by vignesh2027  ·  Star on GitHub  ·  MIT License  ·  Free forever