In 2008, Pietro Perona , Caltech's Allen E. Puckett Professor of Electrical Engineering, was on sabbatical in Italy, enjoying a cappuccino in a piazza when he casually took note of a pigeon pecking at some fallen crumbs. He noticed the fleshy bit at the base of the bird's beak and wondered what purpose it served and what it was called. A first look at anatomical diagrams on the Columbidae (the pigeon family) Wikipedia page left him without an answer.
This simple "visual puzzle" was exactly the kind of problem that Perona and his former student, Serge Belongie (BS '95), then a professor at UC San Diego, had been thinking about a lot. These puzzles pop up frequently in everyday life: You see a mole on your skin and wonder if it is dangerous. You spot Asian script on a sign but are uncertain which language it is. You think, "That is a beautiful flower," but have no way of finding its name.
"We become habituated to the fact that there are so many visual questions we cannot answer," says Perona, who is also the director of the Information Science and Technology initiative at Caltech. "We become almost blind to the things that we don't understand."
Perona and Belongie wanted to make it easier to decode visual information. Wikipedia had shown that quality information could be collected and retrieved easily by searching with a good keyword. Perona and Belongie wondered if they could replace search words with images, but they knew that it would not be easy to convince people to tediously and appropriately tag every object and its parts in millions of pictures.
They envisioned a visual encyclopedia, a digital tool they dubbed Visipedia, that would use computer vision-the set of tools that computers use to "see" images in order to work with them as humans do- and machine learning to help people solve all the little visual puzzles in their lives. Ideally, they thought, you should be able to use the camera on your phone to snap a picture, tap on the desired part of the image, and quickly get directed to relevant information.
In 2010, Perona published the first paper about Visipedia, describing a system that would pair machine learning with human annotators and subject-matter experts, making pictures "first-class citizens alongside text." The system would look to humans and automation to reciprocally reinforce and enhance each other's work, speeding up the gathering and sorting of visual information, and the connection of images with text, all while increasing the accuracy of identification.
In the years that followed, through a lot of mathematical and computational work by Perona's vision group at Caltech and Belongie's group, which had moved to Cornell University, their idea blossomed. The algorithms and methodologies they developed, working with UC San Diego undergrad Grant Van Horn underpin two hugely successful ecological apps used by millions of people: the Cornell Lab of Ornithology's Merlin Bird ID app and iNaturalist , an animal and plant species identification app used around the world. In the process, they drummed up the interest of a whole range of other expert groups that could benefit from the ability to search and identify visual information-from art historians and florists to pathologists and fashionistas. As a result, an entire subfield called fine-grained (meaning highly specific) visual categorization was born.
Last month, Perona and Belongie, now a professor at the University of Copenhagen, were honored with Stibitz-Wilson Awards by the American Computer and Robotics Museum (ACRM) in Bozeman, Montana, for their work on Visipedia, Merlin, and iNaturalist. The awards, named in honor of inventor George Stibitz and biologist Edward O. Wilson, "honor past and present innovators in the computational and biological sciences," according to the ACRM's website.
For Perona, it was a special recognition. "When you work in academia, you work in obscurity for years to understand something, and then you publish it. If your colleagues notice it and tell you that it was interesting and that they use it in their work, that's nice. Now, when the recognition comes from completely outside, then you feel like you've gone one step beyond," he says. "Eventually, we hope to impact society at large."
The Early Days
To trace the history of the project, it is best to reach back to the 1990s, when Perona was a new assistant professor at Caltech (he earned his PhD from UC Berkeley in 1990 and joined the Caltech faculty in 1991). His field of computer vision was mainly concerned with geometry-specifically, with trying to teach computers to recognize the 3D structure of the world with images. Perona had a different interest. He thought it was time for computers to try to categorize scenes and objects from images. He and his students decided to develop algorithms that could recognize faces, cars, or any number of objects rather than geometrical shapes.
"I realized that the way to approach this problem was not through analysis, mathematics, and geometry. It was through machine learning," Perona says. The basic idea was to train a computer system to correctly identify objects-faces, bicycles, buildings, or what have you-by showing it many, many images of those things, enabling it to develop a statistical model to help it identify or perceive those objects.
At the time, Perona says, the concept of visual categorization was new. No one knew exactly how to define the task. In an attempt to explain his intuition, Perona decided that he and his students should collect sets of images and annotate them with tags indicating what appeared, and where, within those images. "Once you have those annotated images, you can tell if an algorithm is able to categorize because it will assign labels to these images in the same way that a human would," he explains.
One of Perona's graduate students at the time, Fei-Fei Li (PhD '05), started off by building an annotated image dataset containing seven categories-airplanes, human faces, cars, spotted cats, motorcycles, bicycles, and tree leaves. Gathering dozens of appropriate images for each category and annotating each one was no trivial task. Eventually, though, they needed to expand the number of categories to ensure that the solutions they were developing for categorizing images were truly general and not merely specialized to their seven categories. Li suggested 10 categories. Perona argued for 100. In 2003, Li delivered 101, and that is how what is known as the Caltech 101 dataset came to be. It included more than 9,000 images-between 40 and 500 per category-all of which were downloaded and annotated by Caltech undergrads.
"Caltech 101 made a huge impact on the computer-vision community," Perona says. "All of a sudden, there was a clear definition of the visual categorization task, and this sparked competition for the field." The Caltech group published a paper describing its first categorization algorithm, which could achieve 18 percent accuracy with the Caltech 101 dataset. That really got the ball rolling. "Many people thought they could do better, so lots of people started working on visual categorization," Perona says. "As a result, the field flipped from geometry to visual categorization in a year or two. It was amazing."
A few years later, Li, now a professor at Stanford University, went on to create ImageNet, the most famous dataset in computer vision, which now contains more than 20,000 categories and millions of images. She also launched the ImageNet Large Scale Visual Recognition Challenge in 2010. AlexNet, a convolutional neural network architecture developed by Jeffrey Hinton of the University of Toronto and two of his students, famously won that competition in 2012, showing a glimpse of what was possible with neural networks trained with graphics processing units (GPUs). This effectively launched the so-called deep-learning revolution.
For their part, Perona and Belongie had some nagging questions on their minds: What could computer vision achieve with machine-learning algorithms and increasingly large datasets? What was the algorithms' ultimate purpose? After all, people are already capable of looking at images and recognizing categories such as faces, bicycles, or leaves for themselves. So, who needed visual categorization?
"As AI was getting more powerful and these huge datasets were taking shape, a trend was established where humans were working for the AI, not the other way around," Belongie says. "You start to wonder, who's this for? Pietro and I really connected because we were both interested in this question. We realized that we have colleagues outside of machine learning that actually do want things identified."
In 2009, at a workshop in Banff, Alberta, where computer-vision researchers met to discuss what was next for the field, Perona and Belongie decided they wanted to develop something along the lines of Visipedia that would focus on sorting images into fine-grained or subordinate categories. ImageNet contained diverse object categories: dogs, container ships, pianos, and so on. They wondered if their algorithms could still correctly classify images when the categories were much more similar. Instead of a bicycle and a spotted cat, could the algorithms discriminate between, say, a mountain bike and a road bike, or a blue jay and a scrub jay?
"We thought we must find a community of people who have this need, who have already tried to organize themselves, and all that is missing is the right technology," Perona explains. They considered dozens of possible image categories: parts of locomotives, cuts of beef, birds, fashion items, postage stamps. "The unspoken criterion was every one of these choices needed to be backed by a community of human experts," Belongie says.
An Ideal Community
They settled on birds. Importantly, birders are a serious community of experts and hobbyists who take a lot of pictures, are fastidious about classifying bird species correctly, and are generally eager to share their knowledge. Plus, this community was already in the habit of communicating with the Cornell Lab of Ornithology (Lab of O), the known hub of birding and bird research in North America. Belongie had a historical connection to the Lab of O. His older sister had worked there, and from visiting it as a child, he considered it to be a "magical place." He knew it would be an ideal organization to work with, but he also knew that they needed to have something to present to them.
In 2011, Perona, Belongie, and their students put together their first bird dataset, Caltech-UCSD Birds 200 (CUB-200), and created an algorithm to try to identify those birds. "We really just threw it together," Belongie recalls. "We didn't know that much about birds yet, but we wanted to get the attention of the Lab of O." The algorithm's performance was steadily increasing.
And then deep learning hit.
For some time, the CUB-200 dataset became one of the most widely used sets for deep learning researchers because, as Belongie says, "It had this cachet of being real." While CUB-200 was designed to address a real problem, Belongie and Perona had never expected it to get much attention. They had selected the 200 bird species almost arbitrarily based on images they could gather easily. But thanks to the timing of its release, right before deep learning exploeded onto the scene, it got a lot of attention. (Deep learning is a type of machine learning that uses multiple layers of neural networks to extract increasing meaning from data-popular large language models like ChatGPT are based on deep learning.) The Lab of O started hearing about the bird dataset. "The Lab of O finally took a look and were appalled at how bad the dataset was," Belongie says. "But they liked what we were trying to do and invited us to meet with them."
Two researchers who had been part of Belongie's group traveled to Ithaca to meet with the Lab of O: Ryan Farrell (then a postdoctoral scholar at UC Berkeley) and undergraduate Grant Van Horn. "Cornell was basically like, 'It's a cool dataset, but it makes no sense. The species that you chose, while maybe posing an interesting computer-vision problem, aren't exactly the species that one would use if you actually wanted to make a useful bird identification service, especially for North America," says Van Horn, who is now a researcher at the Lab of O. "So, we partnered with them to redo that whole process."
They engaged with the Lab of O's experts to painstakingly annotate bird images and to complete a new dataset that would be useful for birders. They also had the Lab of O's extensive eBird database of sightings documented by the birding community to work with. "Grant was able single handedly to develop a deep network that was trainable on the whole dataset that Cornell had," Perona says. The outcome of that project was the photo ID component of the Merlin Bird ID app that was first released to the public in 2017. Merlin users can snap a photo of a bird, upload it to the app, answer a series of questions about their observations, and get a list of likely species to choose from. When they see the right one, they click "This is My Bird," and their observation becomes data that helps improve the algorithm's performance.
Branching Out
A second application of Visipedia also came about through the Lab of O connection. In 2016, the lab hosted an event focused on citizen science and technology and invited Visipedia researchers to attend. There, Van Horn met Scott Loarie from the California Academy of Sciences, one of the directors of an app project called iNaturalist. They wanted Visipedia to do for all animal and plant species what it had done for birds through the Merlin app. By that time, Van Horn was a grad student at Caltech (PhD '19), and he embedded himself with iNaturalist. Within six months, he had developed an algorithm to train on all the plant and animal species documented in the app. Thanks to contributions by millions of users, it now includes more than 100,000 species.
"Though Serge and Pietro never necessarily wanted Visipedia to be focused on biodiversity, from my perspective, it was our partnerships with these biodiversity groups that really let us explore some fun ideas and gave us the resources-both in terms of experts and data-that allowed us to make so much progress," Van Horn says.
Both the Merlin and iNaturalist apps have been widely adopted by experts and hobbyists. "What we saw was that the availability of a computer-vision system that could recognize and classify species of these birds, plants, and animals was indeed making these apps much more successful," Perona says. Together, the apps have now been downloaded more than 30 million times.
With additional users comes additional data, so the algorithms are continually improving. The team has also started examining other ways in which the data could be useful. For example, Perona says, once they had collected enough informal iNaturalist observations, they wondered if it was possible to use this data to estimate distributions of species around the world over time. "Looking at correlations of species tells us something about which species tend to be found together. By exploiting all of these correlations from 100,000 different species worldwide, you can fill in a lot of the gaps," Perona says. More recently, postdoctoral scholar Oisin Mac Aodha, now an associate professor at the University of Edinburgh, and student Eli Cole (PhD '23) started sharing maps with the iNaturalist community, predicting the distribution of species from fairly sparse observations. "As it turns out, these maps are very accurate," Perona says.
The work with iNaturalist and Merlin is ongoing. Van Horn is currently trying to accomplish for audio files on Merlin what they have been able to do with images. And iNaturalist would like to incorporate large-language models like those used in ChatGPT to assist when there are disagreements over classifications in the comments section of iNaturalist.
Looking back, Visipedia's legacy lies in not only the development of these broadly used apps, or even the algorithms at its core. Visipedia has also influenced the development of the field of fine-grained visual categorization (FGVC), which assists any specific community whose experts aim to drill down and identify objects in a more detailed way. Belongie says that when they held the first FGVC workshop at a computer-vision conference in 2011, they thought it would be a one-time event. But a growing community showed interest in the topic, and the 12th workshop took place last year.
"Every year, new groups come out of the woodwork to attend the workshop," Belongie says. "They were the ones who felt burned by the deep-learning revolution. They kept reading these impressive news stories about deep learning solving all these problems, but the methods just didn't work for them." They needed more specific identification-broad categories were of no use to them.
In the race to have the biggest models and the most impressive datasets, the standard methods for deep learning would end up classifying, for example, 100 different butterfly species and 25 moth species as simply "butterfly." But to a butterfly expert or someone whose hobby is observing and identifying butterflies, the small differences matter. "We came to realize while working on Visipedia that those communities that need fine-grained categorization are everywhere, and our algorithms were up to the task."
The Visipedia project has been supported by a Google Focused Research Award and by the Office of Naval Research, the National Science Foundation, the Joan and Irwin Jacobs Technion-Cornell Institute, and Caltech's Resnick Sustainability Institute.