ChatGPT and other generative AI models have achieved notable progress in natural language processing and generation, showing great potential in the medical field, such as automatically generating medical exam questions and answers, acting as personalized learning assistants, supporting course design, and aiding in medical imaging analysis. These models are also expected to be pivotal in training biosafety laboratory researchers by providing interactive learning experiences.
In this study, a dataset of 62 text-based and 8 image-based biosafety questions was collected from leading medical schools, HKU, and the US CDC. For text-based questions, Gemini Pro, Claude-3, Claude-2, GPT-4, and GPT-3.5 were evaluated, while Gemini Pro Vision and GPT-4V were used for image-based questions. Each model generated three responses per question, and metrics such as Reference Answer Accuracy Rate (RAAR), Subjective Answer Accuracy Rate (SAAR), and Strict Accuracy Rate (SAR) were used to analyze performance.
Results showed excellent performance for all models on text-based questions: Gemini Pro reached a RAAR of 79.4%, Claude-3 78.7%, Claude-2 76.5%, GPT-4 75.7%, and GPT-3.5 70.3%. For image-based questions, GPT-4V outperformed Gemini Pro, with RAARs of 78.7% and 76.5% respectively. Multimodal AI models like ChatGPT-4 and Gemini enabled real-time laboratory monitoring, anomaly detection, predictive maintenance, and improved biosafety training and education through automatic generation of customized materials.
However, generative AI faces limitations such as bias, errors, incomplete results, lack of high-quality training data for rare events, inadequate real-time processing, and ethical concerns like privacy and transparency. Addressing these challenges requires uncertainty markers, automated bias detection, human-AI collaborative verification, standardized and simulated datasets, federated learning, explainable AI, and robust accountability mechanisms.