Gkioxari Co-Leads Advanced AI 3D Perception Model

Snap a picture of a busy or complex scene-perhaps a crosswalk in Manhattan or a cluttered family room. Now imagine being able to click on any object in that scene, no matter how occluded or small it might be, and reconstruct that object in three dimensions. Meta Superintelligence Labs recently released an open-source tool called SAM 3D that is capable of just that. Driven by machine learning, SAM 3D marks a clear advance in 3D computer perception, with immediate applications in fields as diverse as biology, gaming, retail, and security. And Meta released it as an open-source product, making it freely available for all to use.

Georgia Gkioxari , assistant professor of computing and mathematical sciences and electrical engineering at Caltech, was a lead on the project, and one of her graduate students, Ziqi Ma, is among the 25 authors of the paper describing the accomplishment.

We sat down with Gkioxari, who is also a William H. Hurt Scholar, to talk about 3D computer perception and how SAM 3D fits in with her work in computer vision, the field devoted to enabling computers to work with visual information and to extract and make use of information found therein.

Can you please give us a basic overview of SAM 3D?

The project is actually two models: SAM 3D Objects and SAM 3D Body , a version for human bodies and shapes. SAM stands for Segment Anything Model, which is the original computer vision model Meta released in 2023 to help identify and isolate objects in images. SAM 3D is the Segment Anything Model for 3D. It takes any real-world image, and for any object seen in the image, it can lift it and reconstruct its 3D shape and also tell you how far away it is from the camera.

How does that fit in with the overall scope of your research?

To me this is a very personal project. I've been working on 3D for years, and, intellectually, it very much follows my work and what I have done. But as with most AI work now, it truly was a big team effort.

If you think of the world, of course it is three dimensional. But somehow the data that dominate our digital world are either one dimensional, which is text, or two dimensional, which are images. So, the 3D dimension of the world is lost.

This is actually fine for many things because the human brain actually fills in the gap. This is why we're carried away by movies and why storytelling is still very immersive-because your brain is doing a lot of the work. The problem is that our machines don't have that. So, in order to enable real-world applications, especially related to robotics, augmented reality, or anything where you might want to create a digital environment, we want to have this ability to create 3D worlds for our machines to be able to operate within the real world better.

Imagine a robot that needs to come into your house and do a lot of chores in a cluttered environment. You're asking it to go clean up the kitchen table. It needs to do a lot of calculations in 3D to do that, but robots are always seeing the world through 2D images. The challenge was: How are we going to overcome the fact that our inputs are two-dimensional images, but actions and outputs need to be grounded in the 3D world? This is where lifting to 3D came in and why it is important.

What's new and innovative about this project?

It's an AI innovation in the sense that, yes, it's a large, data-driven model. But the biggest hurdle was obtaining the data to train the model. 3D data are not readily available, as opposed to text or images. So, we had to be a little bit more creative. The innovation in this work was using human annotators, but not to design in 3D-you would need graphic designers who are very expensive and not always available to annotate images. We had a model-in-the-loop data engine where models propose solutions [here, 3D images] and then human reviewers just select the best solutions. This doesn't require expertise; it just requires common sense. If a shape provided is good, obeys the input image, and is complete, annotators select it. Then that data feeds back to the model, and the model feeds back to the data engine, and it's a loop that keeps going. Slowly but steadily, the model becomes better and annotates more examples, covering more of that distribution of the 3D world.

I want to highlight that this idea could really be useful in any task where you need expertise in labeling. In biology, for example, where doctors are sometimes needed to label data, such as cancerous vs. noncancerous tissue, having this model-in-the-loop design, which we propose, could really increase the scale of data for training.

How is SAM 3D going to be used immediately?

Well, we released SAM 3D full open source, which I think is what makes it a cool scientific contribution, because now engineers, biologists, whoever, can use it whenever they want. We also released a demo so that anyone can upload their own image and test the model out without having access to GPUs [graphics processing units]. Meta has plugged this into Facebook Marketplace, so anytime you want to see a product, you can select the image and see a 3D view. Meta also released a robotic demo where this was used for enabling manipulation in robots, which I think might resonate a lot with Caltech groups.

This is truly just the beginning. We are starting to collaborate with other groups on campus to see how this can be integrated in many different areas.

/Public Release. This material from the originating organization/author(s) might be of the point-in-time nature, and edited for clarity, style and length. Mirage.News does not take institutional positions or sides, and all views, positions, and conclusions expressed herein are solely those of the author(s).View in full here.