HKU Benchmarks AI Reasoning on Chinese Tasks

HKU Business School today released its Large Language Model (LLM) reasoning capability assessment report, benchmarking the reasoning capabilities of 36 leading LLMs using Chinese language and characters. The study was a comprehensive testing of differences in reasoning performance across 36 different models.

The report reveals that GPT-o3 topped the basic logic ability assessment, while Gemini 2.5 Flash took the lead in the contextual reasoning ability assessment. For overall reasoning capability, Doubao 1.5 Pro (Thinking) ranked first, followed closely by Chat GPT-5. Several Chinese LLMs, including Doubao 1.5 Pro, Qwen 3 (Thinking), and DeepSeek-R1, also attained a high ranking on the list, demonstrating the superior reasoning capabilities of Chinese LLMs when it comes to Chinese inputs.

From OpenAI o1's pioneering introduction of its reasoning model to DeepSeek-R1's focus on problem-solving abilities, the LLM market continues to evolve, with reasoning capabilities increasingly being evaluated for reasoning power and accuracy. In light of this, the Artificial Intelligence Evaluation Lab (AIEL) (https://www.hkubs.hku.hk/aimodelrankings_en) at HKU Business School, led by Professor Jack Jiang, developed a comprehensive evaluation system covering both basic logic and contextual reasoning capabilities. Using test sets of varying difficulty, they benchmarked LLMs using Chinese inputs.

Test subjects included 36 mainstream LLMs from China and the United States, including 14 reasoning models, 20 general-purpose models, and two unified systems. The results showed that for basic logical inference, the gap between reasoning models and general-purpose models was relatively small. However, for contextual reasoning, the advantages of reasoning models also became more visible. Moreover, comparisons of models from the same company revealed that reasoning models generally perform better in contextual reasoning, confirming that the overall competitiveness of a model's architectures is best revealed across complex tasks.

Professor Jiang said, "The reasoning capabilities of LLMs are inextricably linked to their cultural and linguistic environments. As the reasoning capabilities of large models gain increasing attention, we hope to use this evaluation system to identify the 'strongest brains' when it comes to the Chinese context. This will then drive the continuous improvement of reasoning capabilities across various models, further optimising efficiency and costs, and enabling them to realise their value in a wider range of application scenarios."

Evaluation Scope and Methodology

In the study, 90% of the questions were original or meticulously adapted, while 10% were selected from Mainland China's high school entrance exams, college entrance exams, and well-known datasets. This approach aimed to authentically test the models' independent reasoning capabilities.

As for question complexity, 60% were simple questions and 40% were complex. A progressively more complex assessment process was employed to accurately characterise the model's reasoning capabilities.

The model's reasoning capabilities were scored based on accuracy (correctness or reasonableness), logical coherence and conciseness.

Basic Logical Inference Capability

In the Basic Logical Inference capability assessment, GPT-o3 took first place, followed closely by Doubao 1.5 Pro (Thinking). Some models, such as Llama 3.3 70B and 360 Zhinao 2-o1, exhibited significant weaknesses in basic logic.

Ranking	Model Name	Basic Logical Inference Weighted Score
1	GPT-o3	97
2	Doubao 1.5 Pro	96
3	Doubao 1.5 Pro (Thinking)	95
4	GPT-5	94
5	DeepSeek-R1	92
6	Qwen 3 (Thinking)	90
7	Gemini 2.5 Pro	88
7	GPT-o4 mini	88
7	Hunyuan-T1	88
7	Ernie X1-Turbo	88
11	GPT-4.1	87
11	GPT-4o	87
11	Qwen 3	87
14	DeepSeek-V3	86
14	Grok 3 (Thinking)	86
14	SenseChat V6 (Thinking)	86
17	Claude 4 Opus	85
17	Claude 4 Opus thinking	85
19	Gemini 2.5 Flash	84
20	SenseChat V6 Pro	83
21	Hunyuan-TurboS	81
22	Baichuan4-Turbo	80
22	Grok 3	80
22	Grok 4	80
22	Yi- Lightning	80
26	MiniMax-01	79
27	Spark 4.0 Ultra	77
27	Step R1-V-Mini	77
29	GLM-4-plus	76
29	GLM-Z1-Air	76
29	Kimi	76
32	Ernie 4.5-Turbo	74
33	Step 2	73
34	Kimi-k1.5	72
35	Llama 3.3 70B	64
36	360 Zhinao 2-o1	59

Table 1: Ranking for Basic Logical Inference Capability

Contextual Reasoning Capability

In the Contextual Reasoning Capability ranking, Gemini 2.5 Flash took first place, excelling in common sense reasoning and subject reasoning. Doubao 1.5 Pro (Thinking) excelled in common sense reasoning, while Gemini 2.5 Pro demonstrated strengths in discipline-based reasoning and decision-making under uncertainty, with both tied for second place. Grok 3 (Thinking), as well as GPT, Ernie, DeepSeek, Hunyuan, and Qwen also performed well.

Ranking	Model Name	Common- sense Reasoning	Discipline- based Reasoning	Decision- Making Under Uncertainty	Moral & Ethical Reasoning	Final Weighted Score
1	Gemini 2.5 Flash	98	93	89	87	92
2	Doubao 1.5 Pro (Thinking)	97	92	88	87	91
2	Gemini 2.5 Pro	93	94	90	87	91
4	Grok 3 (Thinking)	96	88	89	86	90
5	GPT-5	88	98	88	83	89
5	Hunyuan-T1	97	95	84	81	89
5	Qwen 3 (Thinking)	96	89	86	85	89
5	Ernie X1-Turbo	98	85	86	86	89
9	DeepSeek-R1	94	93	78	82	87
9	Qwen 3	97	79	87	86	87
9	Ernie 4.5-Turbo	96	76	87	87	87
12	Hunyuan-TurboS	96	79	83	84	86
13	Doubao 1.5 Pro	97	81	86	74	85
13	GPT-4.1	97	70	87	86	85
13	GPT-o3	90	95	73	80	85
13	Grok 3	97	69	87	86	85
13	Grok 4	82	87	82	87	85
17	DeepSeek-V3	95	81	84	77	84
19	GPT-4o	98	65	87	78	82
19	GPT-o4 mini	91	87	72	76	82
21	Claude 4 Opus thinking	96	84	72	71	81
21	MiniMax-01	96	69	83	75	81
21	360 Zhinao 2-o1	93	76	81	72	81
24	Claude 4 Opus	95	85	70	70	80
24	GLM-4-plus	93	71	83	73	80
24	Step 2	97	63	82	78	80
27	Yi- Lightning	97	59	82	79	79
27	Kimi	94	61	79	81	79
29	Spark 4.0 Ultra	91	71	75	76	78
30	SenseChat V6 Pro	86	58	84	78	77
31	GLM-Z1-Air	90	76	73	64	76
32	Llama 3.3 70B	82	52	83	81	75
33	SenseChat V6 (Thinking)	96	63	68	70	74
34	Baichuan4-Turbo	91	48	77	69	71
35	Step R1-V-Mini	96	80	37	51	66
36	Kimi-k1.5	84	79	42	58	66

Table 2: Ranking for Contextual Reasoning Capability

Composite Ranking Results

In terms of composite capabilities, the 36 models showed significant differences. Doubao 1.5 Pro (Thinking) took the top spot, demonstrating its superior performance in both basic logic inference and contextual reasoning. GPT-5 was the close second, with GPT-o3 and Doubao 1.5 Pro placing third and fourth, respectively.

Ranking	Model Name	Score
1	Doubao 1.5 Pro (Thinking)	93
2	GPT-5	91.5
3	GPT-o3	91
4	Doubao 1.5 Pro	90.5
5	DeepSeek-R1	89.5
5	Gemini 2.5 Pro	89.5
5	Qwen 3 (Thinking)	89.5
8	Hunyuan-T1	88.5
8	Ernie X1-Turbo	88.5
10	Gemini 2.5 flash	88
10	Grok 3 (Thinking)	88
12	Qwen 3	87
13	GPT-4.1	86
14	DeepSeek-V3	85
14	GPT-o4 mini	85
16	GPT-4o	84.5
17	Hunyuan-TurboS	83.5
18	Claude 4 Opus (Thinking)	83
19	Claude 4 Opus	82.5
19	Grok 3	82.5
19	Grok 4	82.5
22	Ernie 4.5-Turbo	80.5
23	MiniMax-01	80
23	SenseChat V6 Pro	80
23	SenseChat V6 (Thinking)	80
26	Yi- Lightning	79.5
27	GLM-4-plus	78
28	Kimi	77.5
28	Spark 4.0 Ultra	77.5
30	Step 2	76.5
30	GLM-Z1-Air	76
32	Baichuan4-Turbo	75.5
33	Step R1-V-Mini	71.5
34	360 Zhina o2-o1	70
35	Llama 3.3 70B	69.5
36	Kimi-k1.5	69

Table 3. Composite Ranking

Click here to view the complete "Large Language Model Reasoning Capability Evaluation Report."

Reviewing the above rankings, we can see that many LLMs from China performed exceptionally well and have made rapid progress, demonstrating the unique advantages and strong potential of China's LLM industry in the Chinese language.

/Public Release. This material from the originating organization/author(s) might be of the point-in-time nature, and edited for clarity, style and length. Mirage.News does not take institutional positions or sides, and all views, positions, and conclusions expressed herein are solely those of the author(s).View in full here.

You might also like