Affective computing, proposed by Picard in 1997, aims to endow computational systems with the ability to recognize, interpret, and respond to human emotions. Early studies relied primarily on behavioral cues such as facial expressions and voice tone for modelling affective states.
Affective computing has entered a new phase—wearable devices are capable of continuously acquiring multimodal physiological signals from multiple sensor channels that differ in terms of sampling frequency, physiological origin, and signal characteristics.
In a new review published in Intelligent Sports and Health, a duo of researchers from the Department of Psychological and Cognitive Sciences at Tsinghua University reported the current landscape in data processing flow, multimodal fusion strategy, and model architecture. They also summarized the development status and challenges of deep learning methods in the field of affective computing.
"Both public datasets and self-collected data play important roles in affective computing studies," shares co-author Dan Zhang. "They show high consistency in terms of modality, device, signal length, number of subjects, and label acquisition. For example, they both make extensive use of common physiological modalities such as EDA and HR and rely mainly on commercially available wearable devices (e.g., Empatica E4) for their measurements."
Notably, most of the labels are acquired by self-assessment tools (e.g., SAM scales), and the number of subjects is generally on the scale of tens of people.
"Some of the self-collected data incorporate sports-related scenarios, such as walking simulations, which record multimodal signals (including EDA, ACC, HR, etc.) to capture affective changes in individuals during physical activity," adds Zhang. "These types of data hold potential value for applications in the sports field, such as monitoring emotional fatigue during training or assessing athletes' emotional regulation capabilities under competitive pressure."
Further, multimodal fusion in affective computing can be implemented at different stages of the modelling pipeline, including feature-level, model-level, and decision-level fusion. "Feature-level fusion provides a simple and easy-to-operate mechanism, model-level fusion can capture cross-modal interactions within the network structure, and decision-level fusion allows for independent processing of each modality," syas co-author Fang Li. "The selection of fusion strategies typically depends on the temporal characteristics, complementarity, and reliability of the involved modalities, as well as the complexity of the classification task."
The authors noted that deep learning methods can extract and model feature representations of input data. Such models include CNNs for extracting local features, LSTM for capturing temporal dependencies, and transformers for addressing temporal dependencies over long distances through a self-attention mechanism.