In an elderly-care themed skit during the 2026 Spring Festival Gala (Chunwan), a lifelike android was modeled on actress CAI Ming. Why are humanoid robots becoming so lifelike and indistinguishable from real humans? One key technology enabling virtual humans to express vivid emotions, recognize identities, and demonstrate embodied intelligence is three-dimensional (3D) facial keypoint detection.
However, due to the lack of large-scale, accurately annotated 3D facial datasets, most current 3D facial landmark detection algorithms rely on two-dimensional (2D) texture assistance or non-real digital 3D faces. The performance of these algorithms is limited by the accuracy of texture mapping and the differences between digital faces and actual human faces.
In a study published in IEEE Transactions on Circuits and Systems for Video Technology , a research team led by Prof. SONG Zhan from the Shenzhen Institutes of Advanced Technology of the Chinese Academy of Sciences, together with Dr. YE Yuping from Fujian University of Technology, developed a new curvature‑fused graph attention network (CF‑GAT) capable of predicting facial landmarks directly from raw point clouds.
The researchers developed a custom 3D/4D facial acquisition system and conducted standardized data collection. The resulting database included approximately 200,000 high-fidelity 3D facial scans, complemented by a multi-expression 3D face dataset, a standardized 3D facial landmark dataset, a high-precision 3D human body dataset, and a dynamic 4D facial expression dataset. This multimodal biometric dataset was selected for Fujian Province's 2025 High-Quality AI Dataset Program.
Moreover, the researchers designed the CF‑GAT for unordered point clouds. They adopted a geometry‑driven sampling strategy that extracts a simplified point set while preserving essential curvature information. This curvature was then encoded as an explicit geometric prior and integrated into the attention mechanism, allowing the network to capture subtle local shape variations. Through a graph attention structure that models both local and global relationships among points, the network predicted 3D landmark coordinates without relying on 2D textures or template models.
Experiments showed that the proposed network achieved higher robustness to noise, stronger generalization across diverse facial shapes, and more accurate localization of fine‑grained landmarks. The work demonstrates how large‑scale, high‑quality datasets can reshape algorithmic performance, enabling models to learn richer geometric patterns and adapt more effectively to real‑world variability.