AI Intelligence and Benchmarking Cost (Feb 2026)
As per the Artificial Analysis Intelligence Index v4.0 (February 2026), the scoring ceiling is set by Claude Opus 4.6 (max) at 53.
Adjusted Score Formula
The “Adjusted Score” follows a quadratic penalty formula:
Adjusted Score = 53 × (1 - (53 - Intel Score)² / 53²)
This creates a steeper penalty for performance gaps compared to a linear scale.
Model Comparison Table
| Lab | Model | Intel Score | Adjusted Score | Benchmark Cost | Intel Ratio (Score/Cost) | Adj. Ratio (Adj/Cost) |
|---|---|---|---|---|---|---|
| Anthropic | Claude Opus 4.6 (max) | 53 | 53 | $2,486.45 | 0.021 | 0.021 |
| OpenAI | GPT-5.2 (xhigh) | 51 | 49 | $2,304.00* | 0.022 | 0.021 |
| Zhipu AI | GLM-5 (Reasoning) | 50 | 47 | $384.00* | 0.130 | 0.122 |
| Gemini 3 Pro | 48 | 43 | $1,179.00* | 0.041 | 0.036 | |
| MiniMax | MiniMax-M2.5 | 42 | 31 | $124.58 | 0.337 | 0.249 |
| DeepSeek | DeepSeek V3.2 (Reasoning) | 42 | 31 | $70.64 | 0.595 | 0.439 |
| xAI | Grok 4 (Reasoning) | 41 | 29 | $1,568.34 | 0.026 | 0.018 |
*Benchmark costs for proprietary models are based on Artificial Analysis evaluation token counts (typically 12M–88M depending on verbosity) multiplied by current API rates.
Key Insights
-
High token reasoning models: Grok 4 and Claude Opus 4.6 use a high number of tokens during reasoning, up to 88M tokens. This results in low Intel-to-Cost ratios despite high scores.
-
DeepSeek V3.2 is the most efficient: It provides an adjusted intelligence ratio that is roughly 20 times better than the proprietary frontier.
-
Cost efficiency comparison: MiniMax-M2.5 and DeepSeek V3.2 share a score of 42. DeepSeek is almost twice as cost-effective due to lower API pricing and higher token efficiency.
Visual Summary
Intel Score vs Cost Efficiency (Adjusted Ratio)
─────────────────────────────────────────────────
DeepSeek V3.2 ████████████████████████████ 0.439
MiniMax-M2.5 ███████████████ 0.249
GLM-5 ███████ 0.122
Gemini 3 Pro ██ 0.036
Claude Opus 4.6 █ 0.021
GPT-5.2 █ 0.021
Grok 4 █ 0.018
Source: Artificial Analysis Intelligence Index v4.0, February 2026
google AI mode made analysis, GLM 5 formatted and added cute graph.
this combines the intelligence score and cost to run the intelligence benchmark from https://artificialanalysis.ai/?endpoints=openai_gpt-5-2-codex%2Cazure_kimi-k2-thinking%2Camazon-bedrock_qwen3-coder-480b-a35b-instruct%2Camazon-bedrock_qwen3-coder-30b-a3b-instruct%2Ctogetherai_minimax-m2-5_fp4%2Ctogetherai_glm-5_fp4%2Ctogetherai_qwen3-next-80b-a3b-reasoning%2Cgoogle_gemini-3-pro_ai-studio%2Cgoogle_glm-4-7%2Cmoonshot-ai_kimi-k2-thinking_turbo%2Cnovita_glm-5_fp8
look at intelligence vs cost graph for further insight. You can add much smaller models for comparison to LLMs you might run locally.
The adjusted intelligence/cost metric is a useful heuristic for “how much would you pay extra to get top score”. Choosing non-open models requires a much higher penalty than 2x the difference/comparison to highest score.
Quantized versions don’t seem to score lower. This site provides good base info to make your own model of “score deficit”, model size, tps as a combined score relative to tokens/cost to get a benchmark score.
I was originally researching how grok 4.2 approach would inflate costs vs performance, but it is not yet benchmarked.

