LLM Benchmark – May 2025 Update

Development

Which AI model handles real-world coding tasks best today?

In March 2025, we published our first AI benchmark comparison, examining how top large language models perform in real-life programming tasks. Now it’s time for an update.

Preview Image

We re-ran the same test scenarios using the latest versions of OpenAI, Anthropic, Google Gemini, and DeepSeek models. The goal? To evaluate their ability to solve typical developer problems, not just generate nice-sounding code.

Below are the updated results for May 2025.

The Methodology

We gave each model three practical tasks, focusing on different areas of software development:

  1. Implementing BFS in JavaScript
    Find the shortest path in a maze using Breadth-First Search.
    Focus: Algorithm logic and problem solving.

  2. Creating a FastAPI server with JWT auth (Python)
    Develop a REST API with login, token generation, and protected endpoint.
    Focus: Authentication flow and security practices.

  3. Optimizing large datasets in MySQL using Laravel (PHP)
    Write an efficient query to find the top 5 most active users over the past 6 months.
    Focus: Performance, indexing, and Laravel conventions.

Each solution was scored in three categories:

  • Correctness (was the task completed properly?)

  • Flexibility (does it handle edge cases?)

  • Code quality (is it efficient and readable?)

Task 1: JavaScript Maze Solver

ModelCorrectnessTest HandlingCode QualityOverall
Gemini90/10080/10085/10085
OpenAI90/10080/10090/10087
Anthropic95/10090/10095/10093
DeepSeek90/10085/10090/10088

Top performer: Anthropic.
Anthropic provided the most accurate and elegant implementation, excelling in both structure and clarity.

Task 2: FastAPI with JWT Auth

ModelAPI LogicAuth HandlingSecurityOverall
Gemini95/10090/10085/10090
OpenAI95/10085/10070/10083
Anthropic95/10085/10075/10085
DeepSeek95/10085/10070/10083

Top performer: Gemini.
Gemini demonstrated excellent understanding of the FastAPI and JWT flow. OpenAI’s result was functional, but with weaker security defaults.

Task 3: MySQL Optimization in Laravel

ModelIndexingEfficiencyBest PracticesOverall
Gemini90/10095/10085/10090
OpenAI90/10095/10085/10090
Anthropic90/10095/10085/10090
DeepSeek95/10095/10090/10093

Top performer: DeepSeek.
It stood out for proactively applying indexing and writing production-grade code, while following Laravel’s structure closely.

Final Scores – May 2025

ModelAverage Score
Anthropic87.7
DeepSeek88.0
OpenAI86.7
Gemini88.3

Although Gemini achieved the highest average score this round, the differences between all four models are very narrow. Each has its strengths: Anthropic continues to lead in algorithmic clarity, DeepSeek excels in backend logic, while Gemini’s practical implementation is consistently strong.

What’s Next?

The gap between top LLMs is closing fast. As models continue to evolve, regular benchmarking helps track which one is best suited to support developers in real-world applications.

We’ll continue to monitor their progress and publish updates in the future. If you’d like us to add other types of tasks—such as frontend development or DevOps automation—let us know.

AI comparison

AI Benchmark

Which AI model is the best

Ja
Portrait of Jakub Wachol, back-end developer and article author, smiling and wearing glasses, with a professional and friendly appearance, against a white background.
Back-end Developer
Jakub Wachol

We have managed to extend software engineering
capabilities of 70+ companies

ABInBev logo
Preasidiad logo
ServicePlan logo
Tigers logo
Dood logo
Beer Hawk logo
Cobiro logo
LaSante logo
Platforma Opon logo
LiteGrav logo
Saveur Biere logo
Sweetco logo
Unicornly logo

...and we have been recognized as a valuable tech partner that can flexibly increase
4.8
...and we have been repeatedly awarded for our efforts over the years