Researchers at the University of Pennsylvania have launched Observer , the first multimodal medical dataset to capture anonymized, real-time interactions between patients and clinicians. Much like the medical drama The Pitt, which portrays life in the emergency room, Observer lets outsiders peer inside primary care clinics — only, in this case, none of the filmed interactions are fictional.
Until now, the data available to health care researchers has been limited to traces left behind after a visit: qualitative information like clinician notes and quantitative measurements like patient vital signs. None of these sources captures subtleties like body language and vocal tone, or the environmental factors, including computer use, that affect how providers and patients engage with one another.
"So much of what shapes medical visits and their outcomes has been invisible to researchers," says Kevin B. Johnson , David L. Cohen University Professor and the lead author of a new paper describing Observer in the Journal of the American Medical Informatics Association . "Thanks to technology that anonymizes our recordings, enabling HIPAA compliance, Observer lets us watch care unfold. That kind of evidence isn't just the foundation for improving clinical practice, it's crucial for developing responsible AI tools to augment care."
Already, the researchers have awarded pilot grants to other teams to begin using Observer, with the goal of expanding the dataset into a national resource for improving health care. "These early projects are the start of a flywheel," says Johnson. "As researchers generate new insights and recordings, the dataset will grow, letting us ask even more ambitious questions."
Why Clinical Data Matters
For decades, researchers have leveraged data about medical visits to study how to improve health care. The Medical Information Mart for Intensive Care , an MIT-affiliated project begun in the 1990s, now contains tens of thousands of records of ICU visits, and has been cited in thousands of research papers covering topics like clinical decision making and hospital operations.
More recently, such data has also played a key role in AI training, since it allows AI models to identify patterns connecting diagnoses, treatments and outcomes across large patient populations. "We've learned a tremendous amount from what gets documented in the medical record," Johnson says. "But if we want to understand the full experience of care, we need data that shows what happens in the room."
With Observer linking video, audio and transcripts to clinical data and electronic health records (EHR), researchers can now ask new questions: when laughter appears during a visit and whether it affects outcomes; how often clinicians look at patients versus their computer screens; how room layout or digital scribing technology changes communication; and how patients respond to explanations of diagnoses.
"This kind of multimodal evidence — combining video, audio and medical records — creates opportunities across so many fields," says Karen O'Connor , Associate Director of Johnson's Artificial Intelligence for Ambulatory Care Innovation (AI-4-AI) Lab. "By making this data available, we're democratizing medical research and opening new paths to improving care."
Ensuring Patient Privacy
In the United States, patient health information is protected by the Health Insurance Portability and Accountability Act (HIPAA), which requires that any data used for research be stripped of identifying details.
For video and audio, that standard has historically been almost impossible to meet. Until recently, creating a data set of real clinical encounters would have required manually reviewing and editing every second of footage and sound, a labor-intensive and error-prone process.
Enter MedVidDeID , a tool the Penn researchers developed to automatically anonymize video and audio recordings from clinical settings, which they describe in a separate paper in the Journal of Biomedical Informatics . In tests, MedVidDeID successfully de-identified more than 90% of video frames without human intervention and reduced total review time by over 60%.
The multi-stage system extracts transcripts, removes identifying text, scrubs audio, transforms voices, and automatically detects and blurs faces and other visual identifiers using state-of-the-art computer-vision models. A human reviewer performs final quality control to ensure total removal of protected health information.
"We built a modular pipeline that automates most of the audio-video de-identification process. By keeping a human in the loop, we're able to protect patient privacy while enabling video-informed research at scale," says Sriharsha Mopidevi , Senior Application Developer in the AI-4-AI Lab and co-author of both papers.
Before collecting data, the researchers ensured that patients, patients' families and clinicians had the opportunity to opt in and later provide feedback on the process. As a result, the team deployed multiple cameras in participating clinics: a fixed room camera to capture the overall visit, a head-mounted camera worn by the clinician to show their perspective, and — when participants opted in — a patient-mounted camera to record the visit from the patient's point of view.
Future Directions
With the first phase of data collection complete and pilot studies underway, the Observer team is preparing to expand the data set and make it available to a wider research community. The team plans to adopt an access model similar to the one used by MIMIC, allowing qualified investigators to apply for permission to use the multimodal recordings for their own studies.
"This is ultimately about changing the health care system," Johnson says. "You cannot improve care or build meaningful clinical AI without understanding the encounter itself. When you can see what happens across hundreds or thousands of visits, transformation becomes possible."
This work was supported by the National Library of Medicine and the NIH Office of the Director under project number 5DP1LM014558-03 (Former Number: 1DP1OD035237-01) for the project entitled "Helping Doctors Doctor: Using AI to Automate Documentation and 'De-Autonomate' Health Care."
Kevin Johnson, M.D., M.S., is the David L. Cohen University Professor of Biomedical Informatics, Computer and Information Science, Pediatrics, and Science Communication at the University of Pennsylvania.
Additional co-authors include Basam Alasaly, Kuk Jin Jang, Eric Eaton and Ross Koppel, all of the University of Pennsylvania.