How AI Support Can Go Wrong In Safety-critical Settings

When it comes to adopting artificial intelligence in high-stakes settings like hospitals and airplanes, good AI performance and a brief worker training on the technology is not sufficient to ensure systems will run smoothly and patients and passengers will be safe, a new study suggests.

Instead, algorithms and the people who use them in the most safety-critical organizations must be evaluated simultaneously to get an accurate view of AI's effects on human decision making, researchers say.

The team also contends these evaluations should assess how people respond to good, mediocre and poor technology performance to put the AI-human interaction to a meaningful test - and to expose the level of risk linked to mistakes.

Participants in the study, led by engineering researchers at The Ohio State University, were 450 Ohio State nursing students, mostly undergraduates with varying amounts of clinical training, and 12 licensed nurses. They used AI-assisted technologies in a remote patient-monitoring scenario to determine how likely urgent care would be needed in a range of patient cases.

Results showed that more accurate AI predictions about whether or not a patient was trending toward a medical emergency improved participant performance by between 50% and 60%. But when the algorithm produced an inaccurate prediction, even when accompanied by explanatory data that didn't support that outcome, human performance collapsed, with an over 100% degradation in proper decision making when the algorithm was the most wrong.

Dane Morey

"An AI algorithm can never be perfect. So if you want an AI algorithm that's ready for safety-critical systems, that means something about the team, about the people and AI together, has to be able to cope with a poor-performing AI algorithm," said first author Dane Morey, a research scientist in the Department of Integrated Systems Engineering at Ohio State.

"The point is this is not about making really good safety-critical system technology. It's the joint human-machine capabilities that matter in a safety-critical system."

Morey completed the study with Mike Rayo, associate professor, and David Woods, faculty emeritus, both in integrated systems engineering at Ohio State. The research was published recently in npj Digital Medicine.

The authors, all members of the Cognitive Systems Engineering Lab directed by Rayo, developed the Joint Activity Testing research program in 2020 to address what they see as a gap in responsible AI deployment in risky environments, especially medical and defense settings.

Mike Rayo

The team is also refining a set of evidence-based guiding principles for machine design with joint activity in mind that can smooth the AI-human performance evaluation process and, after that, actually improve system outcomes.

According to their preliminary list, a machine first and foremost should convey to people the ways in which it is misaligned to the world, even when it is unaware that it is misaligned to the world.

"Even if a technology does well on those heuristics, it probably still isn't quite ready," Rayo said. "We need to do some form of empirical evaluation because those are risk-mitigation steps, and our safety-critical industries deserve at least those two steps of measuring performance of people and AI together and examining a range of challenging cases."

The Cognitive Systems Engineering Lab has been running studies for five years on real technologies to arrive at best-practice evaluation methods, mostly on projects with 20 to 30 participants. Having 462 participants in this project - especially a target population for AI-infused technologies whose study enrollment was connected to a course-based educational activity - gives the researchers high confidence in their findings and recommendations, Rayo said.

Each participant analyzed a sequence of 10 patient cases under differing experimental conditions: no AI help, an AI percentage prediction of imminent need for emergency care, AI annotations of data relevant to the patient's condition, and both AI predictions and annotations.

All examples included a data visualization showing demographics, vital signs and lab results intended to help users anticipate changes to or stability in a patient's status.

Participants were instructed to report their concern for each patient on a scale from 0 to 10. Higher concern for emergency patients and lower concern for non-emergency patients were the indicators deemed to show better performance.

"We found neither the nurses nor the AI algorithm were universally superior to the other in all cases," the authors wrote. The analysis accounted for differences in participants' clinical experience.

While the overall results provided evidence that there is a need for this type of evaluation, the researchers said they were surprised that explanations included in some experimental conditions had very little sway in participant concern - instead, the algorithm recommendation, presented in a solid red bar, overruled everything else.

"Whatever effect that those annotations had was roundly overwhelmed by the presence of that indicator that swept everything else away," Rayo said.

The team considered the study methods, including custom-built technologies representative of health care applications currently in use, as a template for why their recommendations are needed and how industries could put the suggested practices in place.

The coding data for the experimental technologies is publicly available, and Morey, Rayo and Woods further explained their work in an article published at AI-frontiers.org.

"What we're advocating for is a way to help people better understand the variety of effects that may come about from technologies," Morey said. "Basically, the goal is not the best AI performance. It's the best team performance."

This research was funded by the American Nurses Foundation Reimagining Nursing Initiative.

/Public Release. This material from the originating organization/author(s) might be of the point-in-time nature, and edited for clarity, style and length. Mirage.News does not take institutional positions or sides, and all views, positions, and conclusions expressed herein are solely those of the author(s).View in full here.