Interpreting the fine print of a chest CT report can make or break a patient's surgical plan, yet radiologists worldwide face ballooning workloads and widening expertise gaps. A new study from Zhujiang Hospital of Southern Medical University analyzed 13,489 real-world chest CT reports and found that state-of-the-art LLMs can shoulder much of that burden—when asked the right way.
''We discovered that modern language models can act as a dependable second set of eyes for radiologists,'' said Dr. Peng Luo, lead author and physician at Zhujiang Hospital. ''With carefully worded multiple-choice prompts, GPT-4 reached a 75 percent accuracy rate across 13 common chest diseases, ranging from COPD to aortic atherosclerosis.''
The team compared GPT-4, Claude-3.5-Sonnet, Qwen-Max, Gemini-Pro and GPT-3.5-Turbo using two question styles: open-ended and multiple choice. Across all models, multiple-choice prompts boosted accuracy and consistency, underscoring the power of prompt engineering. GPT-4, Claude-3.5 and Qwen-Max topped the charts, while GPT-3.5-Turbo and Gemini-Pro lagged.
To probe whether weaker models could catch up, the researchers fine-tuned GPT-3.5-Turbo on 200 high-performing cases. ''Fine-tuning turned a 42 percent system into a 65 percent system overnight for tough pulmonary cases,'' Dr. Luo said. ''That's a game-changer for hospitals that rely on cost-effective models."
Beyond raw accuracy, the study evaluated each model's area under the ROC curve (AUC) for every disease. GPT-4 excelled at gallstone and pleural effusion detection, while Qwen-Max showed unusual strength in COPD discrimination. However, no single model dominated every condition, suggesting a tailored, disease-specific deployment strategy.
The authors caution that LLM outputs still require expert oversight, especially when a model expresses high confidence in borderline cases. Future work will integrate explainable-AI tools to reveal how models weigh radiologic clues and to set dynamic confidence thresholds.