AI-Powered Interface Decodes Silent Speech in Noise

Beijing Institute of Technology Press Co., Ltd

In high-noise settings—such as industrial sites, military operations, or emergency scenarios—conventional communication tools often fail due to acoustic interference. Existing SSIs, which rely on electroencephalography (EEG), surface electromyography (sEMG), or single-axis strain sensors, face challenges including invasiveness, poor reusability, and inability to capture the full complexity of speech-related muscle movements. The Sung-Min Park team's innovation addresses these gaps with a soft, wearable system that combines multiaxial strain mapping with advanced AI processing.

At the heart of the system is a Computer Vision-Based Optical Strain (CVOS) sensor, integrated into a comfortable neck choker. The sensor features a soft silicone (Ecoflex) substrate embedded with high-contrast black micromarkers on a white background, paired with a tiny camera, compact microscope lens, and LED light source. This design enables high-sensitivity detection of both the magnitude and direction of throat muscle deformations—critical for capturing the complex, multiaxial strain patterns associated with speech.

Key advantages of the CVOS sensor include: (1) Unlike single-axis strain sensors, it captures 2D strain maps, preserving directional information essential for distinguishing speech patterns; (2) Boasting a gauge factor of 3,625, low hysteresis (<0.65%), and high linearity (>0.99), it detects ultrafine strains as small as 0.02%; (3) Resistant to environmental noise and degradation, with consistent performance across devices (mean absolute percentage error of 2.8%) and over 10,000 loading-unloading cycles; (4) Unaffected by environmental noise up to 90 dB, a level equivalent to construction site noise.

The sensor's data is processed through an AI-driven pipeline optimized for silent speech decoding: (1) Automatically accounts for initial residual stress from device attachment (e.g., tightness, position) to eliminate baseline drift; (2) Combines convolutional neural networks (CNNs) for spatial feature extraction with transformers for temporal pattern analysis, capturing both local muscle deformations and global speech dynamics; (3) Model size reduced from 12.4 MB to 3.6 MB via knowledge distillation, enabling real-time inference (0.003 seconds per sample) on edge devices like Raspberry Pi 5; (4) Reconstructs the speaker's unique voice using recordings as short as 10 minutes, enhancing communication naturalness.

The system focuses on recognizing the NATO phonetic alphabet (26 words, e.g., "Alpha" for "A," "Bravo" for "B"), a standardized framework designed to minimize miscommunication in noisy environments. This constrained vocabulary balances practicality with usability, enabling effective communication without requiring large-scale speech recognition.

The researchers validated the system across multiple scenarios, including controlled laboratory tests and real-world noisy environments: (1) Achieved 85.8% recognition accuracy for the 26 NATO phonetic words, with 82% accuracy retained in the lightweight model; (2) Using Low-Rank Adaptation (LoRA) fine-tuning, the system reached 80% accuracy with just 20 samples per class from new users, outperforming traditional fine-tuning (76%); (3) Maintained reliable performance in 90-dB white noise and during gas blowback rifle firing (irregular noise + mechanical vibrations), successfully transmitting words like "Romeo" and "Tango" in real time; (4) Performed consistently across varying device tightness levels and vocal intensities, with highest accuracy (100%) at moderate tightness and vocal effort.

A key demonstration showed the system working seamlessly while a user fired a rifle, with the decoded speech transmitted wirelessly to another room and synthesized into clear audio. The sensor's high signal-to-noise ratio (34 dB)—far exceeding commercial sEMG systems—ensured signal clarity even amid vibrations.

The CVOS-based SSI addresses unmet needs in multiple high-stakes environments: (1) Enables communication between workers in loud factories or construction sites; (2) Offers an alternative communication tool for patients with laryngectomy or voice disorders, avoiding the invasiveness of traditional EEG or sEMG systems.

Future research will focus on expanding the vocabulary beyond the NATO alphabet, enhancing motion artifact resistance (e.g., integrating inertial measurement units), and optimizing the device's ergonomics for long-term wear. The team also plans to validate the system with larger and more diverse user cohorts to strengthen generalizability.

This innovative SSI represents a significant leap forward in noise-robust communication, merging soft wearable sensing with advanced AI to overcome the limitations of conventional speech interfaces. "By capturing the full complexity of throat muscle movements and adapting to real-world variability, our system enables reliable silent speech in environments where traditional microphones fail," noted Prof. Park. The technology's combination of performance, durability, and practicality positions it as a transformative tool for high-noise communication across industrial, military, and clinical domains.

Authors of the paper include Sunguk Hong1, Junyoung Yoo, and Sung-Min Park.

This work was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF), funded by the Ministry of Education, Korea (RS-2024-00412658 and RS-2025-00517742); by the Bio&Medical Technology Development Program of the NRF, funded by the Ministry of Science and ICT (MSIT), Korea (RS-2024-00361688); and by the Pioneer Research Center Program through the NRF, funded by the Ministry of Science, ICT and Future Planning, Korea (2022M3C1A3081294).

The paper, "Soft Multiaxial Strain Mapping Interface with AI-Driven Decoding for Silent Speech in Noise" was published in the journal Cyborg and Bionic Systems on Mar. 23, 2026, at DOI: 10.34133/cbsystems.0536.

/Public Release. This material from the originating organization/author(s) might be of the point-in-time nature, and edited for clarity, style and length. Mirage.News does not take institutional positions or sides, and all views, positions, and conclusions expressed herein are solely those of the author(s).View in full here.

You might also like