Earbuds with Cameras Enable AI-Powered Visual Chat

Two black earbuds: one with the casing removed exposing a computer chip and tiny camera.
UW researchers developed a system called VueBuds that uses tiny cameras in off-the-shelf wireless earbuds to allow users to talk with an AI model about the scene in front of them. Here, the altered headphones are shown with the camera inserted. Photo: Kim et al./CHI '26

University of Washington researchers developed the first system that incorporates tiny cameras in off-the-shelf wireless earbuds to allow users to talk with an AI model about the scene in front of them. For instance, a user might turn to a Korean food package and say, "Hey Vue, translate this for me." They'd then hear an AI voice say, "The visible text translates to 'Cold Noodles' in English."

The prototype system called VueBuds takes low-resolution, black-and-white images, which it transmits over Bluetooth to a phone or other nearby device. A small artificial intelligence model on the device then answers questions about the images within around a second. For privacy, all of the processing happens on the device, a small light turns on when the system is recording, and users can immediately delete images.

The team will present its research April 14 at the Association for Computing Machinery Conference on Human Factors in Computing Systems in Barcelona.

"We haven't seen most people adopt smart glasses or VR headsets, in part because a lot of people don't like wearing glasses, and they often come with privacy concerns, such as recording high-resolution video and processing it in the cloud," said senior author Shyam Gollakota, a UW professor in the Paul G. Allen School of Computer Science & Engineering. "But almost everyone wears earbuds already, so we wanted to see if we could put visual intelligence into tiny, low-power earbuds, and also address privacy concerns in the process."

Cameras use far more power than the microphones already in earbuds, so using the same sort of high-res cameras as those in smart glasses wouldn't work. Also, large amounts of information can't stream continuously over Bluetooth, so the system can't run continuous video.

The team found that using a low-power camera - roughly the size of a grain of rice - to shoot low-resolution, black-and-white still images limited battery drain and allowed for Bluetooth transmission while preserving performance.

There was also the matter of placement.

"One big question we had was: Will your face obscure the view too much? Can earbud cameras capture the user's view of the world reliably?" said lead author Maruchi Kim, who completed this work as a UW doctoral student in the Allen School.

The team found that angling each camera 5-10 degrees outward provides a 98-108 degree field of view. While this creates a small blind spot when objects are held closer than 20 centimeters from the user, people rarely hold things that close to examine them - making it a non-issue for typical interactions.

Researchers also discovered that while the vision language model was largely able to make sense of the images from each earbud, having to process images from both earbuds slowed it down. So they had the system "stitch" the two images into one, identifying overlapping imagery and combining it. This allows the system to respond in one second - quick enough to feel like real-time for users - rather than the two seconds it takes with separate images.

The team then had 74 participants compare recorded outputs from VueBuds with outputs from Ray-Ban Meta Glasses in a series of tests. Despite VueBuds using low-resolution images with greater privacy controls and the Ray-Bans taking high-res images processed on the cloud, the two systems performed equivalently. Participants preferred VueBuds' translations, while the Ray-Bans did better at counting objects.

Related

/Public Release. This material from the originating organization/author(s) might be of the point-in-time nature, and edited for clarity, style and length. Mirage.News does not take institutional positions or sides, and all views, positions, and conclusions expressed herein are solely those of the author(s).View in full here.