When researchers are building large language models (LLMs), they aim to maximize performance under a particular computational and financial budget. Since training a model can amount to millions of dollars, developers need to be judicious with cost-impacting decisions about, for instance, the model architecture, optimizers, and training datasets before committing to a model. To anticipate the quality and accuracy of a large model's predictions, practitioners often turn to scaling laws: using smaller, cheaper models to try to approximate the performance of a much larger target model. The challenge, however, is that there are thousands of ways to create a scaling law.
New work from MIT and MIT-IBM Watson AI Lab researchers addresses this by amassing and releasing a collection of hundreds of models and metrics concerning training and performance to approximate more than a thousand scaling laws. From this, the team developed a meta-analysis and guide for how to select small models and estimate scaling laws for different LLM model families, so that the budget is optimally applied toward generating reliable performance predictions.
"The notion that you might want to try to build mathematical models of the training process is a couple of years old, but I think what was new here is that most of the work that people had been doing before is saying, 'can we say something post-hoc about what happened when we trained all of these models, so that when we're trying to figure out how to train a new large-scale model, we can make the best decisions about how to use our compute budget?'" says Jacob Andreas, associate professor in the Department of Electrical Engineering and Computer Science and principal investigator with the MIT-IBM Watson AI Lab.
The research was recently presented at the International Conference on Machine Learning by Andreas, along with MIT-IBM Watson AI Lab researchers Leshem Choshen and Yang Zhang of IBM Research.
Extrapolating performance
No matter how you slice it, developing LLMs is an expensive endeavor: from decision-making regarding the numbers of parameters and tokens, data selection and size, and training techniques to determining output accuracy and tuning to the target applications and tasks. Scaling laws offer a way to forecast model behavior by relating a large model's loss to the performance of smaller, less-costly models from the same family, avoiding the need to fully train every candidate. Mainly, the differences between the smaller models are the number of parameters and token training size. According to Choshen, elucidating scaling laws not only enable better pre-training decisions, but also democratize the field by enabling researchers without vast resources to understand and build effective scaling laws.
The functional form of scaling laws is relatively simple, incorporating components from the small models that capture the number of parameters and their scaling effect, the number of training tokens and their scaling effect, and the baseline performance for the model family of interest. Together, they help researchers estimate a target large model's performance loss; the smaller the loss, the better the target model's outputs are likely to be.
These laws allow research teams to weigh trade-offs efficiently and to test how best to allocate limited resources. They're particularly useful for evaluating scaling of a certain variable, like the number of tokens, and for A/B testing of different pre-training setups.
In general, scaling laws aren't new; however, in the field of AI, they emerged as models grew and costs skyrocketed. "It's like scaling laws just appeared at some point in the field," says Choshen. "They started getting attention, but no one really tested how good they are and what you need to do to make a good scaling law." Further, scaling laws were themselves also a black box, in a sense. "Whenever people have created scaling laws in the past, it has always just been one model, or one model family, and one dataset, and one developer," says Andreas. "There hadn't really been a lot of systematic meta-analysis, as everybody is individually training their own scaling laws. So, [we wanted to know,] are there high-level trends that you see across those things?"
Building better
To investigate this, Choshen, Andreas, and Zhang created a large dataset. They collected LLMs from 40 model families, including Pythia, OPT, OLMO, LLaMA, Bloom, T5-Pile, ModuleFormer mixture-of-experts, GPT, and other families. These included 485 unique, pre-trained models, and where available, data about their training checkpoints, computational cost (FLOPs), training epochs, and the seed, along with 1.9 million performance metrics of loss and downstream tasks. The models differed in their architectures, weights, and so on. Using these models, the researchers fit over 1,000 scaling laws and compared their accuracy across architectures, model sizes, and training regimes, as well as testing how the number of models, inclusion of intermediate training checkpoints, and partial training impacted the predictive power of scaling laws to target models. They used measurements of absolute relative error (ARE); this is the difference between the scaling law's prediction and the observed loss of a large, trained model. With this, the team compared the scaling laws, and after analysis, distilled practical recommendations for AI practitioners about what makes effective scaling laws.
Their shared guidelines walk the developer through steps and options to consider and expectations. First, it's critical to decide on a compute budget and target model accuracy. The team found that 4 percent ARE is about the best achievable accuracy one could expect due to random seed noise, but up to 20 percent ARE is still useful for decision-making. The researchers identified several factors that improve predictions, like including intermediate training checkpoints, rather than relying only on final losses; this made scaling laws more reliable. However, very early training data before 10 billion tokens are noisy, reduce accuracy, and should be discarded. They recommend prioritizing training more models across a spread of sizes to improve robustness of the scaling law's prediction, not just larger models; selecting five models provides a solid starting point.
Generally, including larger models improves prediction, but costs can be saved by partially training the target model to about 30 percent of its dataset and using that for extrapolation. If the budget is considerably constrained, developers should consider training one smaller model within the target model family and borrow scaling law parameters from a model family with similar architecture; however, this may not work for encoder-decoder models. Lastly, the MIT-IBM research group found that when scaling laws were compared across model families, there was strong correlation between two sets of hyperparameters, meaning that three of the five hyperparameters explained nearly all of the variation and could likely capture the model behavior. Together, these guidelines provide a systematic approach to making scaling law estimation more efficient, reliable, and accessible for AI researchers working under varying budget constraints.
Several surprises arose during this work: small models partially trained are still very predictive, and further, the intermediate training stages from a fully trained model can be used (as if they are individual models) for prediction of another target model. "Basically, you don't pay anything in the training, because you already trained the full model, so the half-trained model, for instance, is just a byproduct of what you did," says Choshen. Another feature Andreas pointed out was that, when aggregated, the variability across model families and different experiments jumped out and was noisier than expected. Unexpectedly, the researchers found that it's possible to utilize the scaling laws on large models to predict performance down to smaller models. Other research in the field has hypothesized that smaller models were a "different beast" compared to large ones; however, Choshen disagrees. "If they're totally different, they should have shown totally different behavior, and they don't."
While this work focused on model training time, the researchers plan to extend their analysis to model inference. Andreas says it's not, "how does my model get better as I add more training data or more parameters, but instead as I let it think for longer, draw more samples. I think there are definitely lessons to be learned here about how to also build predictive models of how much thinking you need to do at run time." He says the theory of inference time scaling laws might become even more critical because, "it's not like I'm going to train one model and then be done. [Rather,] it's every time a user comes to me, they're going to have a new query, and I need to figure out how hard [my model needs] to think to come up with the best answer. So, being able to build those kinds of predictive models, like we're doing in this paper, is even more important."
This research was supported, in part, by the MIT-IBM Watson AI Lab and a Sloan Research Fellowship.