Skip to main content
AI December 2025 • 10 min read

LLM Benchmark Results 2025

Comprehensive comparison of Claude 3.5 Sonnet, GPT-4o, Gemini 1.5 Pro, Llama 3.1, and Mistral Large. We tested quality, coding ability, speed, and cost to help you choose the right model.

By WorkChi Research Team Updated monthly

TL;DR - Quick Recommendations

🏆
Best Overall
Claude 3.5 Sonnet
💰
Best Value
Mistral Large
🔓
Best Open Source
Llama 3.1 405B

Model Comparison

Model Quality Coding Speed Reasoning Cost (in/out)
Claude 3.5 Sonnet
Anthropic
95 96 88 94 $3.00 / $15.00
GPT-4o
OpenAI
92 90 85 91 $5.00 / $15.00
Gemini 1.5 Pro
Google
89 85 90 88 $3.50 / $10.50
Llama 3.1 405B
Meta
88 86 75 87 Self-host
Mistral Large
Mistral AI
86 84 92 85 $2.00 / $6.00

Scores out of 100. Cost per 1M tokens (input/output).

Which Model Should You Use?

Customer Support
Claude 3.5 Sonnet
Best at understanding context and providing helpful, safe responses
Code Generation
Claude 3.5 Sonnet
Highest coding benchmark scores, excellent at debugging
Content Writing
GPT-4o
Natural writing style, good at matching brand voice
Data Analysis
Claude 3.5 Sonnet
Strong reasoning, handles complex analytical tasks
Budget-Conscious
Mistral Large
Best price-to-performance ratio for most tasks
Long Documents
Gemini 1.5 Pro
1M token context window, best for large documents

Benchmark Scores

MMLU (Knowledge)

Claude 3.5
89.3%
GPT-4o
88.7%
Gemini 1.5
85.9%
Llama 3.1
85.5%
Mistral Large
81.2%

HumanEval (Coding)

Claude 3.5
92%
GPT-4o
90.2%
Gemini 1.5
84.1%
Llama 3.1
84%
Mistral Large
81.1%

MATH (Reasoning)

Claude 3.5
71.1%
GPT-4o
68.4%
Gemini 1.5
67.7%
Llama 3.1
66.2%
Mistral Large
58.3%

GSM8K (Math)

Claude 3.5
96.4%
GPT-4o
95.3%
Gemini 1.5
94.4%
Llama 3.1
93.1%
Mistral Large
91.2%

Access All Models via WorkChi

WorkChi's AI Gateway provides unified access to Claude, GPT-4, Gemini, Llama, and Mistral through a single API. 100% EU-hosted for GDPR compliance.

Learn About EU AI Gateway

Try All Models with WorkChi

One API, all major models, EU-hosted. Start your free trial today.

GDPR EU Hosted EU AI Act SOC 2