Socratis: Are large multimodal models emotionally aware?

1Boston University, 2Meta AI (FAIR), 3MIT

ICCV Workshops 2023 - Workshop On Emotionally And Culturally Intelligent AI

Existing emotion prediction benchmarks lack diversity of emotions for an image-caption pair.

We release a Socratis benchmark which contains 18K diverse emotions and reasons for feeling them on 2K image-caption pairs. Our current preliminary findings have shown that Humans prefer human-written emotional reactions over machine-generated by more than two times.

Abstract

Existing emotion prediction benchmarks contain coarse emotion labels which do not consider the diversity of emotions that an image and text can elicit in humans due to various reasons. Learning diverse reactions to multimodal content is important as intelligent machines take a central role in generating and delivering content to society. To address this gap, we propose Socratis, a societal reactions benchmark, where each image-caption (IC) pair is annotated with multiple emotions and the reasons for feeling them. Socratis contains 18K free-form reactions for 980 emotions on 2075 image-caption pairs from 5 widely-read news and image-caption (IC) datasets. We benchmark the capability of state-of-the-art multimodal large language models to generate the reasons for feeling an emotion given an IC pair. Based on a preliminary human study, we observe that humans prefer human-written reasons over 2 times more often than machine-generated ones. This shows our task is harder than standard generation tasks because it starkly contrasts recent findings where humans cannot tell apart machine vs human-written news articles, for instance. We further see that current captioning metrics based on large vision-language models also fail to correlate with human preferences. We hope that these findings and our benchmark will inspire further research on training emotionally aware models.

Approach

Results

Qualitative examples from our Socratis dataset and a state-of-the-art multimodal model, BLIP-v2 generations


Human Study: humans prefer human-written explanations over machine-written

Human study to evaluate human preference of machine-written reactions vs human-written. Users could choose between Human, Machine, Both-Good, or Both Bad. Our preliminary results show that humans largely prefer human-written reactions over BLIP-2 (machine) generated reactions. Human explanations were chosen 1.5 times more often than machine explanations.

Metric Study: Current metrics fail to correlate with human preference

On the left, we ran evaluations with BART and CLIP-Score on subsets where humans prefer human-written reactions, BLIP-2 (machine) reac- tions, or both. We want machine-better or both-good generations scored higher than human-better. From the results, we can determine that commonly-used metrics like BART and CLIP scores, cannot distinguish between good and bad reactions. We chose subsets of reactions where humans rated them as good (machine-better or both-good) and where humans rated them as bad (human-better). Commonly used metrics do not score them differently, suggesting that they do not correlate to human preference. Hence, we need research on new metrics.
On the right, we compare the metrics of an multimodal vs a text-only model on relevance of genera- tion to the image. From the results, we can say that multimodal models are understandably more image relevant than language only models. However, visual relevance doesn’t correlate to human preference.

Future Work

In the future, we look to improve our results by benchmarking more state-of-the-art language and multimodal language models like LLaVA, MiniGPT-4 on our benchmark. Further, we also look to incorporate adaptation methods like in-context learning in multimodal LLMs with example images, captions and explanations.

Related Links

There's also a lot of similar links that examine empathy and emotions in multimodals.

NICE: Neural Image Commenting with Empathy is a machine learning model designed to generate image captions that exhibit a higher level of emotional understanding and empathy. Unlike conventional image captioning models, NICE aims to provide comments that not only describe the visual content but also consider the emotional context and human-like response. It achieves this by incorporating empathy-aware components into the caption generation process, making it a promising development in improving AI-generated image descriptions.

Can machines learn morality? The Delphi Experiment uses deep neural networks to reason about descriptive ethical judgments, such as determining whether an action is generally good or bad. While Delphi shows promise in its ability to generalize ethical reasoning, it also highlights the need for explicit moral instruction in AI, as it can exhibit biases and imperfections.

ArtEmis: Affective Language for Visual Art is a dataset that comprises 439,000 emotion attributions and explanations for 81,000 artworks from WikiArt. Unlike many existing computer vision datasets, ArtEmis centers on the emotional responses evoked by visual art, with annotators indicating their dominant emotions and providing verbal explanations. This dataset serves as the foundation for training captioning systems that excel in expressing and explaining emotional responses to visual stimuli, often surpassing other datasets in conveying the semantic and abstract content of the artworks.

BibTeX


      @misc{deng2023socratis,
        title={Socratis: Are large multimodal models emotionally aware?}, 
        author={Katherine Deng and Arijit Ray and Reuben Tan and Saadia Gabriel and Bryan A. Plummer and Kate Saenko},
        year={2023},
        eprint={2308.16741},
        archivePrefix={arXiv},
        primaryClass={cs.AI}
  }