An international research team led by Assistant Professor Zhiyu Wan from ShanghaiTech University has recently published groundbreaking findings in the journal Health Data Science, highlighting biases in multimodal large language models (LLMs) such as ChatGPT-4 and LLaVA in diagnosing skin diseases from medical images. The study systematically evaluated these AI models across different sex and age groups.
Utilizing approximately 10,000 dermatoscopic images, the study focused on three common skin diseases: melanoma, melanocytic nevi, and benign keratosis-like lesions. Results revealed that while ChatGPT-4 and LLaVA outperformed most traditional deep learning models overall, ChatGPT-4 showed greater fairness across demographic groups, whereas LLaVA exhibited significant sex-related biases.
Dr. Wan emphasized, "While large language models like ChatGPT-4 and LLaVA demonstrate clear potential in dermatology, we must address the observed biases, particularly across sex and age groups, to ensure these technologies are safe and effective for all patients."
The team plans further research incorporating additional demographic variables like skin tone to comprehensively evaluate the fairness and reliability of AI models in clinical scenarios. This research provides critical guidance for developing more equitable and trustworthy medical AI systems.