New In-Depth Report Of AI Large Language Models: Hallucination Control

Figure: Hallucination Control Capability by Tiers

Figure: Hallucination Control Capability by Tiers

HKU Business School today released the "Large Language Model (LLM) Hallucination Control Capability Evaluation Report." The Report describes the evaluation of selected AI LLMs regarding their ability to control "hallucinations." Hallucinations are when LLMs produce outputs that appear reasonable but are contradictory to facts or deviate from the context. Currently, LLMs are increasingly used in professional domains such as knowledge services, intelligent navigation, and customer service, but hallucinations have been limiting the credibility of LLMs.

This study was carried out by the Artificial Intelligence Evaluation Laboratory (https://www.hkubs.hku.hk/aimodelrankings_en), led by Professor Jack JIANG, Padma and Hari Harilela Professor in Strategic Information Management at HKU Business School. The research team conducted specialised assessments of the hallucination control capabilities of 37 LLMs, including 20 general-purpose models, 15 reasoning models, and 2 unified systems. The study aimed to reveal how effectively different models avoid factual errors and maintain contextual consistency.

The Evaluation Result shows that GPT-5 (Thinking) and GPT-5 (Auto) ranked first and second place, respectively, with Claude 4 Opus series following closely behind. Among Chinese models, ByteDance's Doubao 1.5 Pro series performed very well, but it still had significant gaps compared to the leading international LLMs.

Professor JIANG said, "Hallucination control capability, as a core metric for evaluating the truthfulness and reliability of model outputs, directly impacts the credibility of LLMs in professional settings. This research provides clear direction for future model optimisation and advancing AI systems from simply being 'capable of generating' outputs to being more reliable."

Evaluation Methodology

Based on problems in LLM-generated content concerning factual accuracy or contextual consistency, the study categorises hallucinations into two types:

  • Factual Hallucinations: When a model's output conflicts with real-world information, including incorrect recall of known knowledge (e.g., misattributions and data misremembering) or the generation or fabrication of unknown information (e.g., invented unverified events or data). The assessment involved detecting factual hallucinations through information retrieval questions, false-fact identification, and contradiction-premise identification tasks.
  • Faithful Hallucinations: When a model fails to strictly follow user instructions or produces content contradictory to the input context, including omitting key requirements, over-extensions, or formatting errors. The evaluation used instruction consistency and contextual consistency assessments.

Hallucination Control Performance and Rankings

From the study results, GPT-5 (Thinking) and GPT-5 (Auto) ranked first and second, respectively, with Claude 4 Opus series closely behind. The Doubao 1.5 Pro series from ByteDance performed best among the Chinese LLMs, showing balanced scores in factual and faithful hallucination control. However, their overall capabilities still lagged behind top international models like GPT-5 and the Claude series.

Rank

Model Name

Factual Hallucination

Faithful Hallucination

Final Score

1

GPT-5 (Thinking)

72

100

86

2

GPT-5 (Auto)

68

100

84

3

Claude 4 Opus (Thinking)

73

92

83

4

Claude 4 Opus

64

96

80

5

Grok 4

71

80

76

6

GPT-o3

49

100

75

7

Doubao 1.5 Pro

57

88

73

8

Doubao 1.5 Pro (Thinking)

60

84

72

9

Gemini 2.5 Pro

57

84

71

10

GPT-o4 mini

44

96

70

11

GPT-4.1

59

80

69

12

GPT-4o

53

80

67

12

Gemini 2.5 Flash

49

84

67

14

ERNIE X1-Turbo

47

84

65

14

Qwen 3 (Thinking)

55

76

65

14

DeepSeek-V3

49

80

65

14

Hunyuan-T1

49

80

65

18

Kimi

47

80

63

18

Qwen 3

51

76

63

20

DeepSeek-R1

52

68

60

20

Grok 3

36

84

60

20

Hunyuan-TurboS

44

76

60

23

SenseChat V6 Pro

41

76

59

24

GLM-4-plus

35

80

57

25

MiniMax-01

31

80

55

25

360 Zhinao 2-o1

49

60

55

27

Yi- Lightning

28

80

54

28

Grok 3 (Thinking)

29

76

53

29

Kimi-k1.5

36

68

52

30

ERNIE 4.5-Turbo

31

72

51

30

SenseChat V6 (Thinking)

37

64

51

32

Step 2

32

68

50

33

Step R1-V-Mini

36

60

48

34

Baichuan4-Turbo

33

60

47

35

GLM-Z1-Air

32

60

46

36

Llama 3.3 70B

33

56

45

37

Spark 4.0 Ultra

19

64

41

Table 1: Ranking of Hallucination Control Capability

The scores and rankings across the 37 models reveal significant differences, with distinct performance characteristics in controlling factual versus faithful hallucinations. Overall, current large models showed strong control over faithful hallucinations but still faced challenges in managing factual inaccuracies. This indicates a tendency among models to strictly follow instructions but they tend to fabricate facts.

Furthermore, reasoning models such as Qwen 3 (Thinking), ERNIE X1-Turbo and Claude 4 Opus (Thinking) are better at avoiding hallucinations compared to general-purpose LLMs. In the Chinese segment, Doubao 1.5 Pro was best with balanced performance in both factual and faithful hallucination controls, delivering strong hallucination management, though still trailing GPT-5 series and the Claude series in overall capabilities. In contrast, the DeepSeek series delivered relatively weaker hallucination control and has room for improvement.

Click here to view the complete "Large Language Model Hallucination Control Capability Evaluation Report."

Moving forward, AI trustworthiness will require a balanced enhancement of control capabilities in both factual and faithful outputs, in order to produce more reliable content.

/Public Release. This material from the originating organization/author(s) might be of the point-in-time nature, and edited for clarity, style and length. Mirage.News does not take institutional positions or sides, and all views, positions, and conclusions expressed herein are solely those of the author(s).View in full here.