New In-Depth Report Of AI Large Language Models: Hallucination Control

Figure: Hallucination Control Capability by Tiers

HKU Business School today released the "Large Language Model (LLM) Hallucination Control Capability Evaluation Report." The Report describes the evaluation of selected AI LLMs regarding their ability to control "hallucinations." Hallucinations are when LLMs produce outputs that appear reasonable but are contradictory to facts or deviate from the context. Currently, LLMs are increasingly used in professional domains such as knowledge services, intelligent navigation, and customer service, but hallucinations have been limiting the credibility of LLMs.

This study was carried out by the Artificial Intelligence Evaluation Laboratory (https://www.hkubs.hku.hk/aimodelrankings_en), led by Professor Jack JIANG, Padma and Hari Harilela Professor in Strategic Information Management at HKU Business School. The research team conducted specialised assessments of the hallucination control capabilities of 37 LLMs, including 20 general-purpose models, 15 reasoning models, and 2 unified systems. The study aimed to reveal how effectively different models avoid factual errors and maintain contextual consistency.

The Evaluation Result shows that GPT-5 (Thinking) and GPT-5 (Auto) ranked first and second place, respectively, with Claude 4 Opus series following closely behind. Among Chinese models, ByteDance's Doubao 1.5 Pro series performed very well, but it still had significant gaps compared to the leading international LLMs.

Professor JIANG said, "Hallucination control capability, as a core metric for evaluating the truthfulness and reliability of model outputs, directly impacts the credibility of LLMs in professional settings. This research provides clear direction for future model optimisation and advancing AI systems from simply being 'capable of generating' outputs to being more reliable."

Evaluation Methodology

Based on problems in LLM-generated content concerning factual accuracy or contextual consistency, the study categorises hallucinations into two types:

Factual Hallucinations: When a model's output conflicts with real-world information, including incorrect recall of known knowledge (e.g., misattributions and data misremembering) or the generation or fabrication of unknown information (e.g., invented unverified events or data). The assessment involved detecting factual hallucinations through information retrieval questions, false-fact identification, and contradiction-premise identification tasks.

Faithful Hallucinations: When a model fails to strictly follow user instructions or produces content contradictory to the input context, including omitting key requirements, over-extensions, or formatting errors. The evaluation used instruction consistency and contextual consistency assessments.

Hallucination Control Performance and Rankings

From the study results, GPT-5 (Thinking) and GPT-5 (Auto) ranked first and second, respectively, with Claude 4 Opus series closely behind. The Doubao 1.5 Pro series from ByteDance performed best among the Chinese LLMs, showing balanced scores in factual and faithful hallucination control. However, their overall capabilities still lagged behind top international models like GPT-5 and the Claude series.

Rank	Model Name	Factual Hallucination	Faithful Hallucination	Final Score
1	GPT-5 (Thinking)	72	100	86
2	GPT-5 (Auto)	68	100	84
3	Claude 4 Opus (Thinking)	73	92	83
4	Claude 4 Opus	64	96	80
5	Grok 4	71	80	76
6	GPT-o3	49	100	75
7	Doubao 1.5 Pro	57	88	73
8	Doubao 1.5 Pro (Thinking)	60	84	72
9	Gemini 2.5 Pro	57	84	71
10	GPT-o4 mini	44	96	70
11	GPT-4.1	59	80	69
12	GPT-4o	53	80	67
12	Gemini 2.5 Flash	49	84	67
14	ERNIE X1-Turbo	47	84	65
14	Qwen 3 (Thinking)	55	76	65
14	DeepSeek-V3	49	80	65
14	Hunyuan-T1	49	80	65
18	Kimi	47	80	63
18	Qwen 3	51	76	63
20	DeepSeek-R1	52	68	60
20	Grok 3	36	84	60
20	Hunyuan-TurboS	44	76	60
23	SenseChat V6 Pro	41	76	59
24	GLM-4-plus	35	80	57
25	MiniMax-01	31	80	55
25	360 Zhinao 2-o1	49	60	55
27	Yi- Lightning	28	80	54
28	Grok 3 (Thinking)	29	76	53
29	Kimi-k1.5	36	68	52
30	ERNIE 4.5-Turbo	31	72	51
30	SenseChat V6 (Thinking)	37	64	51
32	Step 2	32	68	50
33	Step R1-V-Mini	36	60	48
34	Baichuan4-Turbo	33	60	47
35	GLM-Z1-Air	32	60	46
36	Llama 3.3 70B	33	56	45
37	Spark 4.0 Ultra	19	64	41

Table 1: Ranking of Hallucination Control Capability

The scores and rankings across the 37 models reveal significant differences, with distinct performance characteristics in controlling factual versus faithful hallucinations. Overall, current large models showed strong control over faithful hallucinations but still faced challenges in managing factual inaccuracies. This indicates a tendency among models to strictly follow instructions but they tend to fabricate facts.

Furthermore, reasoning models such as Qwen 3 (Thinking), ERNIE X1-Turbo and Claude 4 Opus (Thinking) are better at avoiding hallucinations compared to general-purpose LLMs. In the Chinese segment, Doubao 1.5 Pro was best with balanced performance in both factual and faithful hallucination controls, delivering strong hallucination management, though still trailing GPT-5 series and the Claude series in overall capabilities. In contrast, the DeepSeek series delivered relatively weaker hallucination control and has room for improvement.

Click here to view the complete "Large Language Model Hallucination Control Capability Evaluation Report."

Moving forward, AI trustworthiness will require a balanced enhancement of control capabilities in both factual and faithful outputs, in order to produce more reliable content.

/Public Release. This material from the originating organization/author(s) might be of the point-in-time nature, and edited for clarity, style and length. Mirage.News does not take institutional positions or sides, and all views, positions, and conclusions expressed herein are solely those of the author(s).View in full here.

You might also like