Large language models (LLMs) such as ChatGPT and Gemini were originally designed to work with text only. Today, they have evolved into systems that can work with many types of information at once (multimodal systems), as well as understand and generate images, audio, speech and music.
The most common way to add speech to multimodal models is to convert it into small building blocks called audio tokens, which function for audio much like characters do for text. However, audio tokens still carry a lot of information, which makes speech harder to handle than text. Despite recent progress, integrating speech into large language models remains a major challenge.
"Speech is an extremely rich and complex signal," says Luca Della Libera, PhD student at the Gina Cody School of Engineering and Computer Science. "Beyond the words we say, it carries information about our emotions, accent, identity and many other cues.
"Because of this complexity, standard audio tokens often have a high bitrate (the amount of information packed into each second of audio). They pack a huge amount of information into each second of audio, which makes it difficult for large language models to learn from speech efficiently."
A focus on speech's meaning
Della Libera and his collaborators developed FocalCodec, a new audio tokenization method that compresses speech far more efficiently than previous approaches. It preserves both the sound and meaning of words at an ultra-low bitrate.
Instead of relying on heavy processing steps, the system uses a simple way of turning audio into compact units (binary spherical quantization) and a technique that helps the model focus on the most meaningful parts of speech (focal modulation). This makes the analysis faster and keeps the essential qualities of the voice intact.
To test FocalCodec, the team conducted a listening study with 33 participants who compared different audio samples. Participants often judged the reconstructed speech as nearly identical to the original recordings. This shows that the system can shrink speech significantly without making it sound robotic or distorted.
Recognized at a top AI conference
The work has been accepted at the Thirty-Ninth Annual Conference on Neural Information Processing Systems, one of the most selective conferences in machine learning and artificial intelligence.
"This work is particularly important, as it introduces a novel approach that can be highly valuable for building modern multimodal LLMs," says Mirco Ravanelli, assistant professor and Della Libera's supervisor. "By making speech lighter and easier to integrate, we move closer to AI systems that understand sound with the same confidence they bring to text."
The work reflects ongoing collaboration between Concordia and Mila - Quebec Artificial Intelligence Institute.
The paper also includes contributions from Francesco Paissan, visiting researcher at Mila and undergraduate student at the University of Trento, and Cem Subakan, affiliate assistant professor at Concordia.
Learn more about Concordia's Gina Cody School of Engineering and Computer Science