| SciPort RLP

Evaluation of cross-ethnic emotion recognition capabilities in multimodal large language models using the reading the mind in the eyes test

Scientific reports. England. 2026

Erscheinungsjahr: 2026

ISBN/ISSN: 2045-2322

Publikationstyp: Zeitschriftenaufsatz

Sprache: Englisch

Doi/URN: 10.1038/s41598-026-39292-y

Volltext über DOI/URN

Geprüft:

Bibliothek

Inhaltszusammenfassung

Accurate emotion recognition is a foundational component of social cognition, yet human biases can compromise its reliability. The emergent capabilities of multimodal large language models (MLLMs) offer a potential avenue for objective analysis, but their performance has been tested mainly with ethnically homogenous stimuli. This study provides a systematic cross-ethnic evaluation of leading MLLMs on an emotion recognition task to assess their accuracy and consistency across diverse gro...Accurate emotion recognition is a foundational component of social cognition, yet human biases can compromise its reliability. The emergent capabilities of multimodal large language models (MLLMs) offer a potential avenue for objective analysis, but their performance has been tested mainly with ethnically homogenous stimuli. This study provides a systematic cross-ethnic evaluation of leading MLLMs on an emotion recognition task to assess their accuracy and consistency across diverse groups. We evaluated three leading MLLMs: ChatGPT-4, ChatGPT-4o, and Claude 3 Opus. Performance was tested twice using three "Reading the Mind in the Eyes Test" (RMET) versions featuring White, Black, and Korean faces. We analyzed accuracy against chance (25%) and compared scores to established human normative data for each ethnic version. ChatGPT-4o achieved performance significantly above chance levels across all tests (p < .001), with large effect sizes indicating robust performance (Cohen's h = 1.253-1.619; RD = 0.583-0.694). The model obtained a mean accuracy of 83.3% (30/36) on the White RMET, 94.4% (34/36) on the Black RMET, and 86.1% (31/36) on the Korean RMET, placing it in the 85th, 94th, and 90th percentiles of human norms, respectively. This high accuracy remained consistent across ethnic stimuli. In contrast, ChatGPT-4 performed near the human average, while Claude 3 Opus performed near chance level. These preliminary findings highlight the rapid evolution of MLLMs, highlighting a significant performance leap between consecutive versions. This study suggests that ChatGPT-4o demonstrated performance scores exceeding average human accuracy on this specific task in recognizing complex emotions from static images of the eye region, with its performance remaining consistent across different ethnic groups. While these results are notable, the pronounced performance gaps between models and the inherent limitations of the RMET task underscore the need for continuous validation and careful, ethical consideration to fully understand the capabilities and boundaries of this technology.» weiterlesen » einklappen