Vision Transformers Mimic Human Gaze Precisely

A research team led by The University of Osaka has demonstrated that vision transformers using self-attention mechanisms can spontaneously develop visual attention patterns similar to humans without specific training

Can machines ever see the world as we see it? Researchers have uncovered compelling evidence that vision transformers (ViTs), a type of deep-learning model that specializes in image analysis, can spontaneously develop human-like visual attention patterns when trained without labeled instructions.

Visual attention is the mechanism by which organisms, or artificial intelligence (AI), filter out 'visual noise' to focus on the most relevant parts of an image or view. While natural for humans, spontaneous learning has proven difficult for AI. However, researchers have revealed, in their recent publication in Neural Networks, that with the right training experience, AI can spontaneously acquire human-like visual attention without being explicitly taught to do so.

The research team, from The University of Osaka, compared human eye-tracking data to attention patterns generated by ViTs trained using DINO ('self-distillation with no labels'), a method of self-supervised learning that allows models to organize visual information without annotated datasets. Remarkably, the DINO-trained ViTs exhibited gaze behavior that closely mirrored that of typically developing adults when viewing dynamic video clips. In contrast, ViTs trained with conventional supervised learning showed unnatural visual attention.

"Our models didn't just attend to visual scenes randomly, they spontaneously developed specialized functions," says Takuto Yamamoto, lead author of the study. "One subset of the model consistently focused on faces, another captured the outlines of entire figures, and a third attended primarily to background features. This closely reflects how human visual systems segment and interpret scenes."

Through detailed analyses, the team demonstrated that these attention clusters emerged naturally in the DINO-trained ViTs. These attention patterns were not only qualitatively similar to the human gaze, but also quantitatively aligned with established eye-tracking data, particularly in scenes involving human figures. The findings suggest a possible extension of the traditional, two-part figure-ground model of perception in psychology into a three-part model.

"What makes this result remarkable is that these models were never told what a face is," explains senior author, Shigeru Kitazawa, "Yet they learned to prioritize faces, probably because doing so maximized the information gained from their environment. It is a compelling demonstration that self-supervised learning may capture something fundamental about how intelligent systems, including humans, learn from the world."

The study underscores the potential of self-supervised learning not only for advancing AI applications, but also for modeling aspects of biological vision. By aligning artificial systems more closely with human perception, self-supervised ViTs offer a new lens for interpreting both machine learning and human cognition. The findings of this study could be used for a variety of applications, such as the development of human-friendly robots or to enhance support during early childhood development.

画像1.png

Fig. 1 Comparison of gaze coordinates between human participants and attention heads of vision transformers (ViTs)

Video clips from N2010 (Nakano et al., 2010) and CW2019 (Costela and Woods, 2019) were presented to ViTs. The gaze positions of each self-attention head in the class token ([CLS]) - identified as peak positions within the self-attention map directed at patch token - were compared with human gaze positions from the respective datasets. There were six ViT models with varying numbers of layers (L = 4, 8, or 12), trained either by supervised learning (SL) or self-supervised learning using the DINO method.

画像2.png

Fig. 2 Attention of DINO-trained ViTs closely resembles that of humans

Top: In a scene depicting a conversation between two children, the human gaze is predominantly directed toward the face of the child on the right (left). A ViT trained with the DINO method focuses on the face of the child on the right (center). In contrast, the gaze of a ViT trained with supervised learning (SL) is scattered (right).

Bottom: The distance between each attention head and human gaze was quantified layer by layer. In the DINO-trained ViT, heads that exhibited attention patterns similar to human gaze emerged in layers 9 and 10.

画像3.png

Fig. 3 Attention heads in DINO-trained ViTs were grouped into three categories

In the DINO-trained ViT12 model, 144 attention heads from layers exhibiting human-like attention were classified using multidimensional scaling based on attention-to-gaze distances across many images. The heads were classified into three distinct groups: G1 focused on the center of figures (e.g., faces), G2 focused on figure outlines (e.g., whole bodies), and G3 focused on the ground (background).

The article "Emergence of Human-Like Attention and Distinct Head Clusters in Self-Supervised Vision Transformers: A Comparative Eye-Tracking Study" has been published in Neural Networks at DOI: https://doi.org/10.1016/j.neunet.2025.107595

/Public Release. This material from the originating organization/author(s) might be of the point-in-time nature, and edited for clarity, style and length. Mirage.News does not take institutional positions or sides, and all views, positions, and conclusions expressed herein are solely those of the author(s).View in full here.