Which AI model handles real-world coding tasks best today?
In March 2025, we published our first AI benchmark comparison, examining how top large language models perform in real-life programming tasks. Now it’s time for an update.

We re-ran the same test scenarios using the latest versions of OpenAI, Anthropic, Google Gemini, and DeepSeek models. The goal? To evaluate their ability to solve typical developer problems, not just generate nice-sounding code.
Below are the updated results for May 2025.
The Methodology
We gave each model three practical tasks, focusing on different areas of software development:
-
Implementing BFS in JavaScript
Find the shortest path in a maze using Breadth-First Search.
Focus: Algorithm logic and problem solving. -
Creating a FastAPI server with JWT auth (Python)
Develop a REST API with login, token generation, and protected endpoint.
Focus: Authentication flow and security practices. -
Optimizing large datasets in MySQL using Laravel (PHP)
Write an efficient query to find the top 5 most active users over the past 6 months.
Focus: Performance, indexing, and Laravel conventions.
Each solution was scored in three categories:
-
Correctness (was the task completed properly?)
-
Flexibility (does it handle edge cases?)
-
Code quality (is it efficient and readable?)
Task 1: JavaScript Maze Solver
Model | Correctness | Test Handling | Code Quality | Overall |
---|---|---|---|---|
Gemini | 90/100 | 80/100 | 85/100 | 85 |
OpenAI | 90/100 | 80/100 | 90/100 | 87 |
Anthropic | 95/100 | 90/100 | 95/100 | 93 |
DeepSeek | 90/100 | 85/100 | 90/100 | 88 |
Top performer: Anthropic.
Anthropic provided the most accurate and elegant implementation, excelling in both structure and clarity.
Task 2: FastAPI with JWT Auth
Model | API Logic | Auth Handling | Security | Overall |
---|---|---|---|---|
Gemini | 95/100 | 90/100 | 85/100 | 90 |
OpenAI | 95/100 | 85/100 | 70/100 | 83 |
Anthropic | 95/100 | 85/100 | 75/100 | 85 |
DeepSeek | 95/100 | 85/100 | 70/100 | 83 |
Top performer: Gemini.
Gemini demonstrated excellent understanding of the FastAPI and JWT flow. OpenAI’s result was functional, but with weaker security defaults.
Task 3: MySQL Optimization in Laravel
Model | Indexing | Efficiency | Best Practices | Overall |
---|---|---|---|---|
Gemini | 90/100 | 95/100 | 85/100 | 90 |
OpenAI | 90/100 | 95/100 | 85/100 | 90 |
Anthropic | 90/100 | 95/100 | 85/100 | 90 |
DeepSeek | 95/100 | 95/100 | 90/100 | 93 |
Top performer: DeepSeek.
It stood out for proactively applying indexing and writing production-grade code, while following Laravel’s structure closely.
Final Scores – May 2025
Model | Average Score |
---|---|
Anthropic | 87.7 |
DeepSeek | 88.0 |
OpenAI | 86.7 |
Gemini | 88.3 |
Although Gemini achieved the highest average score this round, the differences between all four models are very narrow. Each has its strengths: Anthropic continues to lead in algorithmic clarity, DeepSeek excels in backend logic, while Gemini’s practical implementation is consistently strong.
What’s Next?
The gap between top LLMs is closing fast. As models continue to evolve, regular benchmarking helps track which one is best suited to support developers in real-world applications.
We’ll continue to monitor their progress and publish updates in the future. If you’d like us to add other types of tasks—such as frontend development or DevOps automation—let us know.