Advances in Multimodal XR Interaction for Headsets

Higher Education Press

Researchers have conducted the comprehensive review of recent advances in multimodal natural interaction techniques for Extended Reality (XR) headsets, revealing significant trends in spatial computing technologies. This timely review analyzes how recent breakthroughs in artificial intelligence (AI) and large language models (LLMs) are transforming how users interact with virtual environments, offering valuable insights for the future development of more natural, efficient, and immersive XR experiences.

A research team led by Feng Lu systematically reviewed 104 papers published since 2022 in six top venues and published their new review article on 15 December 2025 in Frontiers of Computer Science co-published by Higher Education Press and Springer Nature.

With the widespread adoption of Extended Reality headsets like Microsoft HoloLens 2, Meta Quest 3, and Apple Vision Pro, spatial computing technologies are gaining increasing attention. Natural human-computer interaction is at the core of spatial computing, enabling users to interact with virtual elements through intuitive methods such as eye tracking, hand gestures, and voice commands.

The review classifies interactions based on application scenarios, operation types, and interaction modalities. Operation types are divided into seven categories, distinguishing between active interactions (where users input information) and passive interactions (where users receive feedback). Interaction modalities are explored across nine distinct types, ranging from unimodal interactions (gesture, gaze, speech, or tactile only) to various multimodal combinations.

Statistical analysis of the reviewed literature reveals significant trends. Hand gesture and eye gaze interactions, including their combined modalities, remain the most prevalent. However, there has been a notable increase in speech-related studies in 2024, likely driven by recent advancements in LLMs. Regarding operation types, pointing and selection remains the most focused area, although the number of studies has been decreasing annually, possibly due to the maturity of this research area. Conversely, research on locomotion, viewport control, typing, and querying has increased, reflecting growing attention on users' subjective experiences and the integration of LLMs.

The researchers also identified several challenges in current natural interaction techniques. For example, gesture-only interactions often require users to adapt to complex paradigms, which increases cognitive load. Eye gaze interactions face issues with the "Midas touch" problem, where users unintentionally select items they are merely looking at. Speech-based interactions struggle with latency and recognition accuracy.

Based on these findings, the research team suggests potential directions for future research, including:

  1. Developing more accurate and reliable natural interactions through multimodal integration and error recovery mechanisms
  2. Enhancing the naturalness, comfort, and immersion of XR interactions by reducing physical and cognitive load
  3. Leveraging AI and LLMs to enable more sophisticated, context-aware interactions
  4. Bridging interaction design and practical XR applications to encourage wider adoption

The paper includes detailed illustrations of various interaction techniques, such as gesture-based drawing, gaze vergence control, and LLM-based speech interactions, providing a valuable reference for researchers and practitioners in the field.

This review offers important insights for researchers designing natural and efficient interaction systems for XR, ultimately contributing to the advancement of spatial computing technologies that could transform how we interact with digital information in our daily lives.

/Public Release. This material from the originating organization/author(s) might be of the point-in-time nature, and edited for clarity, style and length. Mirage.News does not take institutional positions or sides, and all views, positions, and conclusions expressed herein are solely those of the author(s).View in full here.