Human facial expression recognitionis important in a variety of human-related systems, including healthcare and medicine. Human facial expression recognition research has matured enough to be used in real-world applications with audio-visual datasets thanks to the recent success of deep learning and the availability of a vast amount of annotated data. The Swin transformer-based human facial expression approach for an in-the-wild audio-visual dataset of the Aff-Wild2 Expression dataset is presented by Jun-Hwa Kim, Namho Kim, and Chee Sun Won of Dongguk University in Seoul, Korea. To merge the multi-modal information into facial expression identification, these researchers used a three-stream network (i.e., visual stream, temporal stream, and audio stream) for the audio-visual videos. When they were tested on the Aff-Wild2 dataset, the multi-modal techniques were found to be useful. The researchers looked at the eight different types of human facial expressions. They presented a three-stream network that uses multi-modal information such as spatial, temporal, and acoustic information to solve the problem. The visual stream, the temporal stream, and the auditory stream make up the three-stream network. A single frame is used in the visual stream, whereas many frames are used in the temporal stream. A picture created by transforming an audio signal into a mel-spectrogram is used in the audio stream.
Recognition of human facial expressions has been a highly prominent task in recent years, not only in AI research but also in practical applications such as health care and medicine. The impressive advancements in deep learning, as well as the availability of large annotated datasets, pave the way for real-world facial expression detection scenarios. In response to this trend, the 3rd Affective Behavior Analysis in the Wild (ABAW 2022) competition, held in conjunction with CVPR 2022, offers a large-scale Aff-Wild2 in-the-wild dataset. The Aff-Wild2 dataset contains 548 films totaling 2,813,201 frames, with annotations for three core tasks: valence-arousal estimate, action unit (AU) recognition, and eight facial expression categorizations. Arousal denotes how active a person is, whereas valence reveals how positive he or she is. To explain an emotion, action units are the basic actions of an individual or muscle group. Neutral, anger, contempt, fear, pleasure, sorrow, surprise, and othersare among the eight facial expressions. The multi-modal job of Aff-Wild2 is performed by the suggested three-stream network in this study. This is because Aff-Wild2 sends cropped face photos and audio from video frames. This means that face expression must be fully predicted based on these multi-modal inputs.
There are 564 videos and 2.8 million frames in the Aff-Wild2 data set. The Aff-Wild2 is made up of eight classes, each with seven emotions, namely, Neutral, Anger, Disgust, Fear, Happiness, Sadness, Surprise. There are 253 videos in the training data set and 70 videos in the validation data set.
To accomplish the goal of categorizing eight facial emotions, these researchers used multi-modal data with a three-stream model that included cropped faces, multiple cropped faces, and sounds from the Aff-Wild2 data set. It outperformed the baseline using the recently developed Swin-transformer, and a performance increase was achieved with the suggested half-mix jittering augmentation.