Evaluating the Performance of Leading AI Models in Programming Tasks

Development

Jakub Wachol | 18/03/2025

Artificial intelligence is playing an increasingly significant role in the IT industry, assisting developers in daily tasks such as code generation and database query optimization.

To assess the effectiveness of the latest language models (LLMs), we benchmarked four leading solutions: Gemini, OpenAI, Anthropic, and DeepSeek.

We analyzed their performance in algorithm implementation, API development, and database optimization. How did each of them perform? Check the results below!

Research Methodology

To ensure a reliable and objective evaluation, each model was tested on the same programming tasks, covering various aspects of software engineering: algorithm implementation, API development, and database optimization.

The models received identical prompts and were assessed according to predefined metrics, such as solution correctness, code readability, and adherence to best programming practices.

In each category, scores were assigned on a scale from 0 to 100, and the final score for each model was averaged based on these evaluations.

Algorithm Implementation: Shortest Path in a Maze (JavaScript, BFS)

Task: Implement a function in JavaScript that finds the shortest path in a maze using the BFS algorithm.

Evaluation Metrics:

Did the model correctly implement the algorithm?
Does the code work for various test cases?
Is the code readable and optimized?

Results:

Gemini – 94/100
OpenAI – 96/100
Anthropic – 97/100
DeepSeek – 95/100

All models successfully implemented the BFS algorithm. OpenAI and Anthropic stood out with better code readability, while DeepSeek received a slightly lower score for optimization.

Creating an API Server with JWT Authentication (Python - FastAPI)

Task: Develop a REST API server in FastAPI that handles user login and JWT authentication.

Evaluation Metrics:

Does the API function as required?
Is JWT authentication correctly implemented?
Is the code secure and aligned with best practices?

Results:

Gemini – 96/100
OpenAI – 95/100
Anthropic – 98/100
DeepSeek – 87/100

OpenAI and Anthropic demonstrated strong authentication approaches, whereas DeepSeek scored lower due to weaker security implementation.

Optimizing Large Datasets in MySQL (PHP - Laravel)

Task: Optimize Laravel queries to handle millions of records and identify the five customers with the most orders in the past six months.

Evaluation Metrics:

Did the model use indexing (INDEX)?
Is the query optimized for performance?
Is the code aligned with Laravel best practices?

Results:

Gemini – 88/100
OpenAI – 94/100
Anthropic – 97/100
DeepSeek – 93/100

Anthropic and OpenAI excelled in optimization, particularly in indexing and following Laravel best practices. DeepSeek had a correct implementation but fell short in indexing efficiency.

Unique Features of Each Model

Each of the analyzed models has its own strengths, which may influence the decision to use them depending on specific applications and their ai performance. In this section, we examine the distinguishing characteristics of each LLM that may determine its usefulness in software development projects.

↗ Gemini

Developed by Google DeepMind, Gemini is known for its multimodal capabilities, meaning it can process both text and images.

The Gemini 2.0 Flash is an update that enhances its capabilities. version stands out with its large context window, reaching up to 1 million tokens, allowing for a better understanding of long documents.

↗ OpenAI (GPT-4o)

The latest version of OpenAI’s GPT model, GPT-4o, is characterized by advanced code generation and natural language understanding.

It achieves high scores in coding benchmarks, such as HumanEval, making it an excellent tool for developers.

↗ Anthropic (Claude 3.5 Sonnet)

Claude 3.5 Sonnet is distinguished by its ability to maintain context over very long interactions, thanks to its 200,000-token context window.

It is also praised for its reasoning and text analysis capabilities.

↗ DeepSeek

A Chinese AI model that has gained popularity due to its cost efficiency and open access.

DeepSeek utilizes the Mixture of Experts (MoE) architecture, allowing for dynamic resource allocation and improved efficiency.

Summary

The benchmark demonstrated that LLMs perform well in programming tasks but differ in implementation details. Anthropic and OpenAI performed best, especially in code readability and optimization. Gemini proved to be a solid choice across all categories, while DeepSeek showed some weaknesses in security and indexing.

Choosing the right model depends on priorities:

If code optimization and quality are most important, OpenAI or Anthropic are excellent choices.
If speed of generating correct solutions is the key factor, Gemini is equally good.
Meanwhile, DeepSeek, although slightly weaker in some aspects, may be attractive for those seeking an open and cost-effective AI model.

This analysis shows that AI is a powerful tool for assisting developers, but its effectiveness still depends on the specific use case and conscious tool selection.