HKU Business School today released its Large Language Model (LLM) reasoning capability assessment report, benchmarking the reasoning capabilities of 36 leading LLMs using Chinese language and characters. The study was a comprehensive testing of differences in reasoning performance across 36 different models.
The report reveals that GPT-o3 topped the basic logic ability assessment, while Gemini 2.5 Flash took the lead in the contextual reasoning ability assessment. For overall reasoning capability, Doubao 1.5 Pro (Thinking) ranked first, followed closely by Chat GPT-5. Several Chinese LLMs, including Doubao 1.5 Pro, Qwen 3 (Thinking), and DeepSeek-R1, also attained a high ranking on the list, demonstrating the superior reasoning capabilities of Chinese LLMs when it comes to Chinese inputs.
From OpenAI o1's pioneering introduction of its reasoning model to DeepSeek-R1's focus on problem-solving abilities, the LLM market continues to evolve, with reasoning capabilities increasingly being evaluated for reasoning power and accuracy. In light of this, the Artificial Intelligence Evaluation Lab (AIEL) (https://www.hkubs.hku.hk/aimodelrankings_en) at HKU Business School, led by Professor Jack Jiang, developed a comprehensive evaluation system covering both basic logic and contextual reasoning capabilities. Using test sets of varying difficulty, they benchmarked LLMs using Chinese inputs.
Test subjects included 36 mainstream LLMs from China and the United States, including 14 reasoning models, 20 general-purpose models, and two unified systems. The results showed that for basic logical inference, the gap between reasoning models and general-purpose models was relatively small. However, for contextual reasoning, the advantages of reasoning models also became more visible. Moreover, comparisons of models from the same company revealed that reasoning models generally perform better in contextual reasoning, confirming that the overall competitiveness of a model's architectures is best revealed across complex tasks.
Professor Jiang said, "The reasoning capabilities of LLMs are inextricably linked to their cultural and linguistic environments. As the reasoning capabilities of large models gain increasing attention, we hope to use this evaluation system to identify the 'strongest brains' when it comes to the Chinese context. This will then drive the continuous improvement of reasoning capabilities across various models, further optimising efficiency and costs, and enabling them to realise their value in a wider range of application scenarios."
Evaluation Scope and Methodology
In the study, 90% of the questions were original or meticulously adapted, while 10% were selected from Mainland China's high school entrance exams, college entrance exams, and well-known datasets. This approach aimed to authentically test the models' independent reasoning capabilities.
As for question complexity, 60% were simple questions and 40% were complex. A progressively more complex assessment process was employed to accurately characterise the model's reasoning capabilities.
The model's reasoning capabilities were scored based on accuracy (correctness or reasonableness), logical coherence and conciseness.
Basic Logical Inference Capability
In the Basic Logical Inference capability assessment, GPT-o3 took first place, followed closely by Doubao 1.5 Pro (Thinking). Some models, such as Llama 3.3 70B and 360 Zhinao 2-o1, exhibited significant weaknesses in basic logic.
Ranking |
Model Name |
Basic Logical Inference Weighted Score |
1 |
GPT-o3 |
97 |
2 |
Doubao 1.5 Pro |
96 |
3 |
Doubao 1.5 Pro (Thinking) |
95 |
4 |
GPT-5 |
94 |
5 |
DeepSeek-R1 |
92 |
6 |
Qwen 3 (Thinking) |
90 |
7 |
Gemini 2.5 Pro |
88 |
7 |
GPT-o4 mini |
88 |
7 |
Hunyuan-T1 |
88 |
7 |
Ernie X1-Turbo |
88 |
11 |
GPT-4.1 |
87 |
11 |
GPT-4o |
87 |
11 |
Qwen 3 |
87 |
14 |
DeepSeek-V3 |
86 |
14 |
Grok 3 (Thinking) |
86 |
14 |
SenseChat V6 (Thinking) |
86 |
17 |
Claude 4 Opus |
85 |
17 |
Claude 4 Opus thinking |
85 |
19 |
Gemini 2.5 Flash |
84 |
20 |
SenseChat V6 Pro |
83 |
21 |
Hunyuan-TurboS |
81 |
22 |
Baichuan4-Turbo |
80 |
22 |
Grok 3 |
80 |
22 |
Grok 4 |
80 |
22 |
Yi- Lightning |
80 |
26 |
MiniMax-01 |
79 |
27 |
Spark 4.0 Ultra |
77 |
27 |
Step R1-V-Mini |
77 |
29 |
GLM-4-plus |
76 |
29 |
GLM-Z1-Air |
76 |
29 |
Kimi |
76 |
32 |
Ernie 4.5-Turbo |
74 |
33 |
Step 2 |
73 |
34 |
Kimi-k1.5 |
72 |
35 |
Llama 3.3 70B |
64 |
36 |
360 Zhinao 2-o1 |
59 |
Table 1: Ranking for Basic Logical Inference Capability
Contextual Reasoning Capability
In the Contextual Reasoning Capability ranking, Gemini 2.5 Flash took first place, excelling in common sense reasoning and subject reasoning. Doubao 1.5 Pro (Thinking) excelled in common sense reasoning, while Gemini 2.5 Pro demonstrated strengths in discipline-based reasoning and decision-making under uncertainty, with both tied for second place. Grok 3 (Thinking), as well as GPT, Ernie, DeepSeek, Hunyuan, and Qwen also performed well.
Ranking |
Model Name |
Common- sense Reasoning |
Discipline- based Reasoning |
Decision- Making Under Uncertainty |
Moral & Ethical Reasoning |
Final Weighted Score |
1 |
Gemini 2.5 Flash |
98 |
93 |
89 |
87 |
92 |
2 |
Doubao 1.5 Pro (Thinking) |
97 |
92 |
88 |
87 |
91 |
2 |
Gemini 2.5 Pro |
93 |
94 |
90 |
87 |
91 |
4 |
Grok 3 (Thinking) |
96 |
88 |
89 |
86 |
90 |
5 |
GPT-5 |
88 |
98 |
88 |
83 |
89 |
5 |
Hunyuan-T1 |
97 |
95 |
84 |
81 |
89 |
5 |
Qwen 3 (Thinking) |
96 |
89 |
86 |
85 |
89 |
5 |
Ernie X1-Turbo |
98 |
85 |
86 |
86 |
89 |
9 |
DeepSeek-R1 |
94 |
93 |
78 |
82 |
87 |
9 |
Qwen 3 |
97 |
79 |
87 |
86 |
87 |
9 |
Ernie 4.5-Turbo |
96 |
76 |
87 |
87 |
87 |
12 |
Hunyuan-TurboS |
96 |
79 |
83 |
84 |
86 |
13 |
Doubao 1.5 Pro |
97 |
81 |
86 |
74 |
85 |
13 |
GPT-4.1 |
97 |
70 |
87 |
86 |
85 |
13 |
GPT-o3 |
90 |
95 |
73 |
80 |
85 |
13 |
Grok 3 |
97 |
69 |
87 |
86 |
85 |
13 |
Grok 4 |
82 |
87 |
82 |
87 |
85 |
17 |
DeepSeek-V3 |
95 |
81 |
84 |
77 |
84 |
19 |
GPT-4o |
98 |
65 |
87 |
78 |
82 |
19 |
GPT-o4 mini |
91 |
87 |
72 |
76 |
82 |
21 |
Claude 4 Opus thinking |
96 |
84 |
72 |
71 |
81 |
21 |
MiniMax-01 |
96 |
69 |
83 |
75 |
81 |
21 |
360 Zhinao 2-o1 |
93 |
76 |
81 |
72 |
81 |
24 |
Claude 4 Opus |
95 |
85 |
70 |
70 |
80 |
24 |
GLM-4-plus |
93 |
71 |
83 |
73 |
80 |
24 |
Step 2 |
97 |
63 |
82 |
78 |
80 |
27 |
Yi- Lightning |
97 |
59 |
82 |
79 |
79 |
27 |
Kimi |
94 |
61 |
79 |
81 |
79 |
29 |
Spark 4.0 Ultra |
91 |
71 |
75 |
76 |
78 |
30 |
SenseChat V6 Pro |
86 |
58 |
84 |
78 |
77 |
31 |
GLM-Z1-Air |
90 |
76 |
73 |
64 |
76 |
32 |
Llama 3.3 70B |
82 |
52 |
83 |
81 |
75 |
33 |
SenseChat V6 (Thinking) |
96 |
63 |
68 |
70 |
74 |
34 |
Baichuan4-Turbo |
91 |
48 |
77 |
69 |
71 |
35 |
Step R1-V-Mini |
96 |
80 |
37 |
51 |
66 |
36 |
Kimi-k1.5 |
84 |
79 |
42 |
58 |
66 |
Table 2: Ranking for Contextual Reasoning Capability
Composite Ranking Results
In terms of composite capabilities, the 36 models showed significant differences. Doubao 1.5 Pro (Thinking) took the top spot, demonstrating its superior performance in both basic logic inference and contextual reasoning. GPT-5 was the close second, with GPT-o3 and Doubao 1.5 Pro placing third and fourth, respectively.
Ranking |
Model Name |
Score |
1 |
Doubao 1.5 Pro (Thinking) |
93 |
2 |
GPT-5 |
91.5 |
3 |
GPT-o3 |
91 |
4 |
Doubao 1.5 Pro |
90.5 |
5 |
DeepSeek-R1 |
89.5 |
5 |
Gemini 2.5 Pro |
89.5 |
5 |
Qwen 3 (Thinking) |
89.5 |
8 |
Hunyuan-T1 |
88.5 |
8 |
Ernie X1-Turbo |
88.5 |
10 |
Gemini 2.5 flash |
88 |
10 |
Grok 3 (Thinking) |
88 |
12 |
Qwen 3 |
87 |
13 |
GPT-4.1 |
86 |
14 |
DeepSeek-V3 |
85 |
14 |
GPT-o4 mini |
85 |
16 |
GPT-4o |
84.5 |
17 |
Hunyuan-TurboS |
83.5 |
18 |
Claude 4 Opus (Thinking) |
83 |
19 |
Claude 4 Opus |
82.5 |
19 |
Grok 3 |
82.5 |
19 |
Grok 4 |
82.5 |
22 |
Ernie 4.5-Turbo |
80.5 |
23 |
MiniMax-01 |
80 |
23 |
SenseChat V6 Pro |
80 |
23 |
SenseChat V6 (Thinking) |
80 |
26 |
Yi- Lightning |
79.5 |
27 |
GLM-4-plus |
78 |
28 |
Kimi |
77.5 |
28 |
Spark 4.0 Ultra |
77.5 |
30 |
Step 2 |
76.5 |
30 |
GLM-Z1-Air |
76 |
32 |
Baichuan4-Turbo |
75.5 |
33 |
Step R1-V-Mini |
71.5 |
34 |
360 Zhina o2-o1 |
70 |
35 |
Llama 3.3 70B |
69.5 |
36 |
Kimi-k1.5 |
69 |
Table 3. Composite Ranking
Click here to view the complete "Large Language Model Reasoning Capability Evaluation Report."
Reviewing the above rankings, we can see that many LLMs from China performed exceptionally well and have made rapid progress, demonstrating the unique advantages and strong potential of China's LLM industry in the Chinese language.