HKU Benchmarks AI Reasoning on Chinese Tasks

HKU Business School today released its Large Language Model (LLM) reasoning capability assessment report, benchmarking the reasoning capabilities of 36 leading LLMs using Chinese language and characters. The study was a comprehensive testing of differences in reasoning performance across 36 different models.

The report reveals that GPT-o3 topped the basic logic ability assessment, while Gemini 2.5 Flash took the lead in the contextual reasoning ability assessment. For overall reasoning capability, Doubao 1.5 Pro (Thinking) ranked first, followed closely by Chat GPT-5. Several Chinese LLMs, including Doubao 1.5 Pro, Qwen 3 (Thinking), and DeepSeek-R1, also attained a high ranking on the list, demonstrating the superior reasoning capabilities of Chinese LLMs when it comes to Chinese inputs.

From OpenAI o1's pioneering introduction of its reasoning model to DeepSeek-R1's focus on problem-solving abilities, the LLM market continues to evolve, with reasoning capabilities increasingly being evaluated for reasoning power and accuracy. In light of this, the Artificial Intelligence Evaluation Lab (AIEL) (https://www.hkubs.hku.hk/aimodelrankings_en) at HKU Business School, led by Professor Jack Jiang, developed a comprehensive evaluation system covering both basic logic and contextual reasoning capabilities. Using test sets of varying difficulty, they benchmarked LLMs using Chinese inputs.

Test subjects included 36 mainstream LLMs from China and the United States, including 14 reasoning models, 20 general-purpose models, and two unified systems. The results showed that for basic logical inference, the gap between reasoning models and general-purpose models was relatively small. However, for contextual reasoning, the advantages of reasoning models also became more visible. Moreover, comparisons of models from the same company revealed that reasoning models generally perform better in contextual reasoning, confirming that the overall competitiveness of a model's architectures is best revealed across complex tasks.

Professor Jiang said, "The reasoning capabilities of LLMs are inextricably linked to their cultural and linguistic environments. As the reasoning capabilities of large models gain increasing attention, we hope to use this evaluation system to identify the 'strongest brains' when it comes to the Chinese context. This will then drive the continuous improvement of reasoning capabilities across various models, further optimising efficiency and costs, and enabling them to realise their value in a wider range of application scenarios."

Evaluation Scope and Methodology

In the study, 90% of the questions were original or meticulously adapted, while 10% were selected from Mainland China's high school entrance exams, college entrance exams, and well-known datasets. This approach aimed to authentically test the models' independent reasoning capabilities.

As for question complexity, 60% were simple questions and 40% were complex. A progressively more complex assessment process was employed to accurately characterise the model's reasoning capabilities.

The model's reasoning capabilities were scored based on accuracy (correctness or reasonableness), logical coherence and conciseness.

Basic Logical Inference Capability

In the Basic Logical Inference capability assessment, GPT-o3 took first place, followed closely by Doubao 1.5 Pro (Thinking). Some models, such as Llama 3.3 70B and 360 Zhinao 2-o1, exhibited significant weaknesses in basic logic.

Ranking

Model Name

Basic Logical Inference

Weighted Score

1

GPT-o3

97

2

Doubao 1.5 Pro

96

3

Doubao 1.5 Pro (Thinking)

95

4

GPT-5

94

5

DeepSeek-R1

92

6

Qwen 3 (Thinking)

90

7

Gemini 2.5 Pro

88

7

GPT-o4 mini

88

7

Hunyuan-T1

88

7

Ernie X1-Turbo

88

11

GPT-4.1

87

11

GPT-4o

87

11

Qwen 3

87

14

DeepSeek-V3

86

14

Grok 3 (Thinking)

86

14

SenseChat V6 (Thinking)

86

17

Claude 4 Opus

85

17

Claude 4 Opus thinking

85

19

Gemini 2.5 Flash

84

20

SenseChat V6 Pro

83

21

Hunyuan-TurboS

81

22

Baichuan4-Turbo

80

22

Grok 3

80

22

Grok 4

80

22

Yi- Lightning

80

26

MiniMax-01

79

27

Spark 4.0 Ultra

77

27

Step R1-V-Mini

77

29

GLM-4-plus

76

29

GLM-Z1-Air

76

29

Kimi

76

32

Ernie 4.5-Turbo

74

33

Step 2

73

34

Kimi-k1.5

72

35

Llama 3.3 70B

64

36

360 Zhinao 2-o1

59

Table 1: Ranking for Basic Logical Inference Capability

Contextual Reasoning Capability

In the Contextual Reasoning Capability ranking, Gemini 2.5 Flash took first place, excelling in common sense reasoning and subject reasoning. Doubao 1.5 Pro (Thinking) excelled in common sense reasoning, while Gemini 2.5 Pro demonstrated strengths in discipline-based reasoning and decision-making under uncertainty, with both tied for second place. Grok 3 (Thinking), as well as GPT, Ernie, DeepSeek, Hunyuan, and Qwen also performed well.

Ranking

Model Name

Common-

sense

Reasoning

Discipline-

based

Reasoning

Decision-

Making

Under

Uncertainty

Moral &

Ethical

Reasoning

Final

Weighted

Score

1

Gemini 2.5 Flash

98

93

89

87

92

2

Doubao 1.5

Pro (Thinking)

97

92

88

87

91

2

Gemini 2.5 Pro

93

94

90

87

91

4

Grok 3 (Thinking)

96

88

89

86

90

5

GPT-5

88

98

88

83

89

5

Hunyuan-T1

97

95

84

81

89

5

Qwen 3 (Thinking)

96

89

86

85

89

5

Ernie X1-Turbo

98

85

86

86

89

9

DeepSeek-R1

94

93

78

82

87

9

Qwen 3

97

79

87

86

87

9

Ernie 4.5-Turbo

96

76

87

87

87

12

Hunyuan-TurboS

96

79

83

84

86

13

Doubao 1.5 Pro

97

81

86

74

85

13

GPT-4.1

97

70

87

86

85

13

GPT-o3

90

95

73

80

85

13

Grok 3

97

69

87

86

85

13

Grok 4

82

87

82

87

85

17

DeepSeek-V3

95

81

84

77

84

19

GPT-4o

98

65

87

78

82

19

GPT-o4 mini

91

87

72

76

82

21

Claude 4 Opus

thinking

96

84

72

71

81

21

MiniMax-01

96

69

83

75

81

21

360 Zhinao 2-o1

93

76

81

72

81

24

Claude 4 Opus

95

85

70

70

80

24

GLM-4-plus

93

71

83

73

80

24

Step 2

97

63

82

78

80

27

Yi- Lightning

97

59

82

79

79

27

Kimi

94

61

79

81

79

29

Spark 4.0 Ultra

91

71

75

76

78

30

SenseChat V6 Pro

86

58

84

78

77

31

GLM-Z1-Air

90

76

73

64

76

32

Llama 3.3 70B

82

52

83

81

75

33

SenseChat

V6 (Thinking)

96

63

68

70

74

34

Baichuan4-Turbo

91

48

77

69

71

35

Step R1-V-Mini

96

80

37

51

66

36

Kimi-k1.5

84

79

42

58

66

Table 2: Ranking for Contextual Reasoning Capability

Composite Ranking Results

In terms of composite capabilities, the 36 models showed significant differences. Doubao 1.5 Pro (Thinking) took the top spot, demonstrating its superior performance in both basic logic inference and contextual reasoning. GPT-5 was the close second, with GPT-o3 and Doubao 1.5 Pro placing third and fourth, respectively.

Ranking

Model Name

Score

1

Doubao 1.5 Pro (Thinking)

93

2

GPT-5

91.5

3

GPT-o3

91

4

Doubao 1.5 Pro

90.5

5

DeepSeek-R1

89.5

5

Gemini 2.5 Pro

89.5

5

Qwen 3 (Thinking)

89.5

8

Hunyuan-T1

88.5

8

Ernie X1-Turbo

88.5

10

Gemini 2.5 flash

88

10

Grok 3 (Thinking)

88

12

Qwen 3

87

13

GPT-4.1

86

14

DeepSeek-V3

85

14

GPT-o4 mini

85

16

GPT-4o

84.5

17

Hunyuan-TurboS

83.5

18

Claude 4 Opus (Thinking)

83

19

Claude 4 Opus

82.5

19

Grok 3

82.5

19

Grok 4

82.5

22

Ernie 4.5-Turbo

80.5

23

MiniMax-01

80

23

SenseChat V6 Pro

80

23

SenseChat V6 (Thinking)

80

26

Yi- Lightning

79.5

27

GLM-4-plus

78

28

Kimi

77.5

28

Spark 4.0 Ultra

77.5

30

Step 2

76.5

30

GLM-Z1-Air

76

32

Baichuan4-Turbo

75.5

33

Step R1-V-Mini

71.5

34

360 Zhina o2-o1

70

35

Llama 3.3 70B

69.5

36

Kimi-k1.5

69

Table 3. Composite Ranking

Click here to view the complete "Large Language Model Reasoning Capability Evaluation Report."

Reviewing the above rankings, we can see that many LLMs from China performed exceptionally well and have made rapid progress, demonstrating the unique advantages and strong potential of China's LLM industry in the Chinese language.

/Public Release. This material from the originating organization/author(s) might be of the point-in-time nature, and edited for clarity, style and length. Mirage.News does not take institutional positions or sides, and all views, positions, and conclusions expressed herein are solely those of the author(s).View in full here.