Just as people from different countries speak different languages, AI models also create various internal "languages" – a unique set of tokens understood only by each model. Until recently, there was no way for models developed by different companies to communicate directly, collaborate or combine their strengths to improve performance. This week, at the International Conference on Machine Learning (ICML) in Vancouver, Canada, scientists from the Weizmann Institute of Science and Intel Labs are presenting a new set of algorithms that overcome this barrier, enabling users to benefit from combined computational power of AI models working together. The new algorithms, already available to millions of AI developers around the world, speed up the performance of large language models (LLMs) – today's leading models of generative AI – by 1.5 times, on average.
LLMs, such as ChatGPT and Gemini, are powerful tools, but they come with significant drawbacks: They are slow and consume large amounts of computing power. In 2022, major tech companies realized that AI models, like people, could benefit from collaboration and division of labor. This led to the development of a method called speculative decoding, in which a small, fast model, possessing relatively limited knowledge, makes a first guess while answering a user's query, and a larger, more powerful but slower model reviews and corrects the answer if needed. Speculative decoding was quickly adopted by tech giants because it maintains 100-percent accuracy – unlike most acceleration techniques, which reduce output quality. But it had one big limitation: Both models had to "speak" the exact same digital language, which meant that models developed by different companies could not be combined.
"Tech giants adopted speculative decoding, benefiting from faster performance and saving billions of dollars a year in cost of processing power, but they were the only ones to have access to small, faster models that speak the same language as larger models," explains Nadav Timor, a PhD student in Prof. David Harel 's research team in Weizmann's Computer Science and Applied Mathematics Department, who led the new development. "In contrast, a startup seeking to benefit from speculative decoding had to train its own small model that matched the language of the big one, and that takes a great deal of expertise and costly computational resources."
The new algorithms developed by Weizmann and Intel researchers allow developers to pair any small model with any large model, causing them to work as a team. To overcome the language barrier, the researchers came up with two solutions.
First, they designed an algorithm that allows an LLM to translate its output from its internal token language into a shared format that all models can understand. Second, they created another algorithm that prompts such models to mainly rely in their collaborative work on tokens that have the same meaning across models, similarly to words like "banana" or "internet" that are nearly identical across human languages.
"At first, we worried that too much information would be 'lost in translation' and that different models wouldn't be able to collaborate effectively," says Timor. "But we were wrong. Our algorithms speed up the performance of LLMs by up to 2.8 times, leading to massive savings in spending on processing power."
The significance of this research has been recognized by ICML organizers, who selected the study for public presentation – a distinction granted to only about 1 percent of the 15,000 submissions received this year. "We have solved a core inefficiency in generative AI," says Oren Pereg , a senior researcher at Intel Labs and co-author of the study. "This isn't just a theoretical improvement; these are practical tools that are already helping developers build faster and smarter applications."
In the past several months, the team released their algorithms on the open-source AI platform Hugging Face Transformers, making them freely available to developers around the world. The algorithms have since become part of standard tools for running efficient AI processes.
"This new development is especially important for edge devices, from phones and drones to autonomous cars, which must rely on limited computing power when not connected to the internet," Timor adds. "Imagine, for example, a self-driving car that is guided by an AI model. In this case, a faster model can make the difference between a safe decision and a dangerous error."
Also participating in the study were Dr. Jonathan Mamou, Daniel Korat , Moshe Berchansky and Moshe Wasserblat from Intel Labs and Gaurav Jain from d-Matrix.
Prof. David Harel is the incumbent of the William Sussman Professorial Chair of Mathematics.