Evaluating the Performance of Leading AI Models in Programming Tasks

Development

Artificial intelligence is playing an increasingly significant role in the IT industry, assisting developers in daily tasks such as code generation and database query optimization.

To assess the effectiveness of the latest language models (LLMs), we benchmarked four leading solutions: Gemini, OpenAI, Anthropic, and DeepSeek.

We analyzed their performance in algorithm implementation, API development, and database optimization. How did each of them perform? Check the results below!

Research Methodology

To ensure a reliable and objective evaluation, each model was tested on the same programming tasks, covering various aspects of software engineering: algorithm implementation, API development, and database optimization. 

The models received identical prompts and were assessed according to predefined metrics, such as solution correctness, code readability, and adherence to best programming practices. 

In each category, scores were assigned on a scale from 0 to 100, and the final score for each model was averaged based on these evaluations.

Algorithm Implementation: Shortest Path in a Maze (JavaScript, BFS)

Task: Implement a function in JavaScript that finds the shortest path in a maze using the BFS algorithm.

Evaluation Metrics:

  • Did the model correctly implement the algorithm?

  • Does the code work for various test cases?

  • Is the code readable and optimized?

Results:

  • Gemini – 94/100

  • OpenAI – 96/100

  • Anthropic – 97/100

  • DeepSeek – 95/100

All models successfully implemented the BFS algorithm. OpenAI and Anthropic stood out with better code readability, while DeepSeek received a slightly lower score for optimization.

Creating an API Server with JWT Authentication (Python - FastAPI)

Task: Develop a REST API server in FastAPI that handles user login and JWT authentication.

Evaluation Metrics:

  • Does the API function as required?

  • Is JWT authentication correctly implemented?

  • Is the code secure and aligned with best practices?

Results:

  • Gemini – 96/100

  • OpenAI – 95/100

  • Anthropic – 98/100

  • DeepSeek – 87/100

OpenAI and Anthropic demonstrated strong authentication approaches, whereas DeepSeek scored lower due to weaker security implementation.

Optimizing Large Datasets in MySQL (PHP - Laravel)

Task: Optimize Laravel queries to handle millions of records and identify the five customers with the most orders in the past six months.

Evaluation Metrics:

  • Did the model use indexing (INDEX)?

  • Is the query optimized for performance?

  • Is the code aligned with Laravel best practices?

Results:

  • Gemini – 88/100

  • OpenAI – 94/100

  • Anthropic – 97/100

  • DeepSeek – 93/100

Anthropic and OpenAI excelled in optimization, particularly in indexing and following Laravel best practices. DeepSeek had a correct implementation but fell short in indexing efficiency.

Unique Features of Each Model

Each of the analyzed models has its own strengths, which may influence the decision to use them depending on specific applications and their ai performance. In this section, we examine the distinguishing characteristics of each LLM that may determine its usefulness in software development projects.

↗ Gemini

Developed by Google DeepMind, Gemini is known for its multimodal capabilities, meaning it can process both text and images. 

The Gemini 2.0 Flash is an update that enhances its capabilities. version stands out with its large context window, reaching up to 1 million tokens, allowing for a better understanding of long documents.

↗ OpenAI (GPT-4o)

The latest version of OpenAI’s GPT model, GPT-4o, is characterized by advanced code generation and natural language understanding

It achieves high scores in coding benchmarks, such as HumanEval, making it an excellent tool for developers.

↗ Anthropic (Claude 3.5 Sonnet)

Claude 3.5 Sonnet is distinguished by its ability to maintain context over very long interactions, thanks to its 200,000-token context window

It is also praised for its reasoning and text analysis capabilities.

↗ DeepSeek

A Chinese AI model that has gained popularity due to its cost efficiency and open access.

DeepSeek utilizes the Mixture of Experts (MoE) architecture, allowing for dynamic resource allocation and improved efficiency.

Summary

The benchmark demonstrated that LLMs perform well in programming tasks but differ in implementation details. Anthropic and OpenAI performed best, especially in code readability and optimization. Gemini proved to be a solid choice across all categories, while DeepSeek showed some weaknesses in security and indexing.

Choosing the right model depends on priorities:

  • If code optimization and quality are most important, OpenAI or Anthropic are excellent choices.

  • If speed of generating correct solutions is the key factor, Gemini is equally good.

  • Meanwhile, DeepSeek, although slightly weaker in some aspects, may be attractive for those seeking an open and cost-effective AI model.

This analysis shows that AI is a powerful tool for assisting developers, but its effectiveness still depends on the specific use case and conscious tool selection.


Ja
Portrait of Jakub Wachol, back-end developer and article author, smiling and wearing glasses, with a professional and friendly appearance, against a white background.
Back-end Developer
Jakub Wachol

Latest articles

We have managed to extend software engineering
capabilities of 70+ companies

ABInBev logo
Preasidiad logo
ServicePlan logo
Tigers logo
Dood logo
Beer Hawk logo
Cobiro logo
LaSante logo
Platforma Opon logo
LiteGrav logo
Saveur Biere logo
Sweetco logo
Unicornly logo

...and we have been recognized as a valuable tech partner that can flexibly increase
4.8
...and we have been repeatedly awarded for our efforts over the years