LLM Benchmark – May 2025 Update

Development

Jakub Wachol | 29/05/2025

Which AI model handles real-world coding tasks best today?

In March 2025, we published our first AI benchmark comparison, examining how top large language models perform in real-life programming tasks. Now it’s time for an update.

We re-ran the same test scenarios using the latest versions of OpenAI, Anthropic, Google Gemini, and DeepSeek models. The goal? To evaluate their ability to solve typical developer problems, not just generate nice-sounding code.

Below are the updated results for May 2025.

The Methodology

We gave each model three practical tasks, focusing on different areas of software development:

Implementing BFS in JavaScript
Find the shortest path in a maze using Breadth-First Search.
Focus: Algorithm logic and problem solving.
Creating a FastAPI server with JWT auth (Python)
Develop a REST API with login, token generation, and protected endpoint.
Focus: Authentication flow and security practices.
Optimizing large datasets in MySQL using Laravel (PHP)
Write an efficient query to find the top 5 most active users over the past 6 months.
Focus: Performance, indexing, and Laravel conventions.

Each solution was scored in three categories:

Correctness (was the task completed properly?)
Flexibility (does it handle edge cases?)
Code quality (is it efficient and readable?)

Task 1: JavaScript Maze Solver

Model	Correctness	Test Handling	Code Quality	Overall
Gemini	90/100	80/100	85/100	85
OpenAI	90/100	80/100	90/100	87
Anthropic	95/100	90/100	95/100	93
DeepSeek	90/100	85/100	90/100	88

Top performer: Anthropic.
Anthropic provided the most accurate and elegant implementation, excelling in both structure and clarity.

Task 2: FastAPI with JWT Auth

Model	API Logic	Auth Handling	Security	Overall
Gemini	95/100	90/100	85/100	90
OpenAI	95/100	85/100	70/100	83
Anthropic	95/100	85/100	75/100	85
DeepSeek	95/100	85/100	70/100	83

Top performer: Gemini.
Gemini demonstrated excellent understanding of the FastAPI and JWT flow. OpenAI’s result was functional, but with weaker security defaults.

Task 3: MySQL Optimization in Laravel

Model	Indexing	Efficiency	Best Practices	Overall
Gemini	90/100	95/100	85/100	90
OpenAI	90/100	95/100	85/100	90
Anthropic	90/100	95/100	85/100	90
DeepSeek	95/100	95/100	90/100	93

Top performer: DeepSeek.
It stood out for proactively applying indexing and writing production-grade code, while following Laravel’s structure closely.

Final Scores – May 2025

Model	Average Score
Anthropic	89.3
DeepSeek	88.0
OpenAI	86.7
Gemini	88.3

Although Gemini achieved the highest average score this round, the differences between all four models are very narrow. Each has its strengths: Anthropic continues to lead in algorithmic clarity, DeepSeek excels in backend logic, while Gemini’s practical implementation is consistently strong.

What’s Next?

The gap between top LLMs is closing fast. As models continue to evolve, regular benchmarking helps track which one is best suited to support developers in real-world applications.

We’ll continue to monitor their progress and publish updates in the future. If you’d like us to add other types of tasks—such as frontend development or DevOps automation—let us know.