When it comes to navigating their surroundings, machines have a natural disadvantage compared to humans. To help hone the visual perception abilities they need to understand the world, researchers have developed a novel training dataset for improving spatial awareness in robots.
In new research, experiments showed that robots trained with this dataset, called RoboSpatial, outperformed those trained with baseline models at the same robotic task, demonstrating a complex understanding of both spatial relationships and physical object manipulation.
For humans, visual perception shapes how we interact with the environment, from recognizing different people to maintaining an awareness of our body's movements and position. Despite previous attempts to imbue robots with these skills, efforts have fallen short as most are trained on data that lacks sophisticated spatial understanding.
Because deep spatial comprehension is necessary for intuitive interactions, if left unaddressed, these spatial reasoning challenges could hinder future AI systems' ability to comprehend complex instructions and operate in dynamic environments, said Luke Song, lead author of the study and a current PhD student in engineering at The Ohio State University.

"To have true general-purpose foundation models, a robot needs to understand the 3D world around it," he said. "So spatial understanding is one of the most crucial capabilities for it."
The study was recently given as an oral presentation at the Conference on Computer Vision and Pattern Recognition.
To teach robots how to better interpret perspective, RoboSpatial includes more than a million real-world indoor and tabletop images, thousands of detailed 3D scans, and 3 million labels describing rich spatial information relevant to robotics. Using these vast resources, the framework pairs 2D egocentric images with full 3D scans of the same scene so the model learns to pinpoint objects using either flat-image recognition or 3D geometry.
According to the study, it's a process that closely mimics visual cues in the real world.
For instance, while current training datasets might allow a robot to accurately describe a "bowl on the table," the model would lack the ability to discern where on the table it actually is, where it should be placed to remain accessible, or how it might fit in with other objects. In contrast, RoboSpatial could rigorously test these spatial reasoning skills in practical robotic tasks, first by demonstrating object rearrangement and then by examining the models' capacity to generalize to new spatial reasoning scenarios beyond their original training data.
"Not only does this mean improvements on individual actions like picking up and placing things, but also leads to robots interacting more naturally with humans," said Song.
One of the systems the team tested this framework on was a Kinova Jaco robot, an assistive arm that helps people with disabilities connect with their environment.
During training, it was able to answer simple close-ended spatial questions like "can the chair be placed in front of the table?" or "is the mug to the left of the laptop?" correctly.
These promising results reveal that normalizing spatial context by improving robotic perception could lead to safer and more reliable AI systems, said Song.
While there are still many unanswered questions about AI development and training, the work concludes that RoboSpatial has the potential to serve as a foundation for broader applications in robotics, noting that more exciting spatial advancements will likely branch from it.
"I think we will see a lot of big improvements and cool capabilities for robots in the next five to ten years," said Song.
Co-authors include Yu Su from Ohio State and Valts Blukis, Jonathan Tremblay, Stephen Tyree and Stan Birchfield from NVIDIA. This work was supported by the Ohio Supercomputer Center.