Machine learning models are designed to take in data, to find patterns or relationships within those data, and to use what they have learned to make predictions or to create new content. The quality of those outputs depends not only on the details of a model's inner workings but also, crucially, on the information that is fed into the model. Some models follow a brute force approach, essentially adding every bit of data related to a particular problem into the model and seeing what comes out. But a sleeker, less energy-hungry way to approach a problem is to determine which variables are vital to the outcome and only provide the model with information about those key variables.
Now, Adrián Lozano-Durán , an associate professor of aerospace at Caltech and a visiting professor at MIT, and MIT graduate student Yuan Yuan, have developed a theorem that takes any number of possible variables and whittles them down, leaving only those that are most important. In the process, the model removes all units, such as meters and feet, from the underlying equations, making them dimensionless, something scientists require of equations that describe the physical world. The work can be applied not only to machine learning but to any mathematical model.
"The theorem we derived will tell you, even for a collection of inputs that have dimensions, how to construct dimensionless inputs that contain the maximum amount of information about what you want to predict," Lozano-Durán says. "It will also tell you the percent error on the best possible prediction you can make with that information."
Lozano-Durán and Yuan describe their new method in a paper that appears in the journal Nature Communications.
For many physical models, Lozano-Durán says, it is possible to have a collection of thousands or even millions of variables that are related in some way to a prediction you would like to make. But not all those variables will be equally useful in making a prediction. Consider the problem of predicting tomorrow's temperature in Pasadena. A model for this problem could include thousands of variables, from measurements of barometric pressure and windspeed at multiple times and locations, to oceanic-buoy readings of temperatures above the sea, to satellite measurements of water vapor. Now, say you also decided to include the driver's license numbers of every driver in California. Those numbers represent more data for the model to consider-a lot of data. But it is not relevant to the question of what tomorrow's weather will be.
The example may seem silly, but it makes clear what the new method aims to do: Strip away the variables that do not contain information that will help a model make the best possible prediction. "Why are all the license numbers not useful to predict temperature tomorrow? Because there is no information about temperature contained there. And that's the key," Lozano-Durán says. "When we add input variables, there is hidden information there about what you want to predict, and that information is what you need to extract. The quality of your prediction is related to how much information your input has about your output."
Lozano-Duran and Yuan call their new method IT-π, where IT stands for information theory, on which the method is built. For any given variable, the method calculates how much information in the output can be obtained from that input. IT-π pictures the relationship between input and output as a Venn diagram, where the input is one circle and the output another. The method seeks to determine how much those circles overlap for each variable. If there is no overlap, there is no prediction. If they completely overlap, the input fully predicts the output. The method combines variables in different ways and measures the overlap for each of those scenarios, eventually homing in on the highest possible overlap. "When we cannot increase the overlap anymore, the method has found the best possible variables," Lozano-Durán explains.
In the new paper, Lozano-Durán and Yuan use the new method to make a variety of predictions. In one example, they wanted to determine which inputs to feed into a neural network used to calculate the heat flux (how much the temperature experienced by a space capsule would change) as it entered the Martian atmosphere. The researchers had access to data for 20 different variables that could be included, such as velocity and temperature at different locations. In the end, their analysis dictated that they only needed data combined into two variables constructed as ratios of characteristic fluxes, such as heat and mass, or other physical terms such as energies and timescales. Those variables capture the relative importance of competing processes.
No Units Needed
Importantly, Lozano-Durán notes, the variables that come out of his new theorem are dimensionless. There is a fundamental mathematical concept in physics called the Buckingham π theorem that says that when you construct a model about the real world, you should be able to rewrite all of its equations in a form where none of the variables depend upon the units of measurement that are used. Edgar Buckingham, the early 20th-century American physicist for whom the theory is named, provided a formalized way of transforming equations to arrive at such so-called dimensionless parameters.
"All the equations in physics need to follow this property. If you change the units, the equation remains the same," says Lozano-Durán. For example, the gravitational force between Earth and the sun must not depend on whether the distance between the two bodies is measured in miles or kilometers, or whether the masses are measured in pounds or kilograms. "If you see an equation where you change the units and the equation is different, there's something wrong."
Returning to Lozano-Durán's Martian-spacecraft example, the Buckingham π theorem says that seven variables, rather than two, are needed to determine the heat flux. According to Lozano-Durán, a researcher would need to perform some 2,000 experiments to gather the required data for those seven variables and create a very simple model of heat flux. "According to our results, you need only nine experiments," he says. The tool also told the researchers that by completing those nine experiments, they could have 92 percent certainty that they had arrived at the correct heat-flux prediction.
The IT-π method can save time, energy, and money, he says. In particular, the technique could cut down on the amount of data necessary to train AI models faster while using less electricity. "Today, especially in these machine learning models that are kind of a big black box, it's very important to make sure that everything you give them is meaningful," he says. "The more variables you have as input, the more training data you need. So, you want to have the minimum number of variables while not affecting the performance of your model."
The paper is titled "Dimensionless learning based on information." The work was supported by funding from the National Science Foundation, an Early Career Faculty grant from NASA's Space Technology Research Grants Program, and the MIT International Science and Technology Initiatives Global Seed Fund.