LLM Evaluation Framework

Benchmark GPT-4 · Claude · Gemini · Mistral · Llama — Side by Side

Accuracy · Latency · Cost · Hallucination Rate · Reasoning Quality

Model Leaderboard — MMLU Benchmark (100 samples)

Ranked by accuracy. Same prompts, same conditions, real API calls.

Note: These are sample results. Run the full framework with your own API keys for live benchmarks.

Full Leaderboard

#	Model	Provider	Accuracy (%)	Latency (ms)	Cost / 1K ($)	Hallucination (%)	Reasoning / 10	Value Score
1	Claude 3.5 Sonnet	Anthropic	88.2	1240	0.0003	1.8	8.4	261333

Key Insights

Finding	Detail
Best Accuracy	GPT-4o (88.2%) and Claude 3.5 Sonnet (87.6%) — nearly tied
Best Value	GPT-4o-mini — 78.4% accuracy at $0.0003/1K (27× cheaper than GPT-4o)
Fastest	Gemini 1.5 Flash — 380ms avg, $0.0001/1K (cheapest of all)
Best Reasoning	Claude 3.5 Sonnet — 8.6/10 reasoning quality score
Accuracy Gap	Only 10% separates best and worst — cost differs by 90×

Run This Yourself

pip install llm-evaluation-framework
llm-eval compare \
  --models gpt-4o-mini \
  --models claude-3-haiku-20240307 \
  --models gemini/gemini-1.5-flash \
  --benchmark mmlu --samples 100

Built by vignesh2027 · Star on GitHub · MIT License · Free forever