SOTAVerified

EQ-Bench: An Emotional Intelligence Benchmark for Large Language Models

2023-12-11Code Available2· sign in to hype

Samuel J. Paech

Code Available — Be the first to reproduce this paper.

Reproduce

Code

Abstract

We introduce EQ-Bench, a novel benchmark designed to evaluate aspects of emotional intelligence in Large Language Models (LLMs). We assess the ability of LLMs to understand complex emotions and social interactions by asking them to predict the intensity of emotional states of characters in a dialogue. The benchmark is able to discriminate effectively between a wide range of models. We find that EQ-Bench correlates strongly with comprehensive multi-domain benchmarks like MMLU (Hendrycks et al., 2020) (r=0.97), indicating that we may be capturing similar aspects of broad intelligence. Our benchmark produces highly repeatable results using a set of 60 English-language questions. We also provide open-source code for an automated benchmarking pipeline at https://github.com/EQ-bench/EQ-Bench and a leaderboard at https://eqbench.com

Tasks

Benchmark Results

DatasetModelMetricClaimedVerifiedStatus
EQ-BenchOpenAI gpt-4-0613EQ-Bench Score62.52Unverified
EQ-Benchmigtissera/SynthIA-70B-v1.5EQ-Bench Score54.83Unverified
EQ-BenchOpenAI gpt-4-0314EQ-Bench Score53.39Unverified
EQ-BenchQwen/Qwen-72B-ChatEQ-Bench Score52.44Unverified
EQ-BenchAnthropic Claude2EQ-Bench Score52.14Unverified
EQ-Benchmeta-llama/Llama-2-70b-chat-hfEQ-Bench Score51.56Unverified
EQ-Bench01-ai/Yi-34B-ChatEQ-Bench Score51.03Unverified
EQ-BenchOpenAI gpt-3.5-0613EQ-Bench Score49.17Unverified
EQ-BenchOpenAI gpt-3.5-turbo-0301EQ-Bench Score47.61Unverified
EQ-BenchOpen-Orca/Mistral-7B-OpenOrcaEQ-Bench Score44.4Unverified
EQ-BenchQwen/Qwen-14B-ChatEQ-Bench Score43.76Unverified
EQ-BenchOpenAI text-davinci-003EQ-Bench Score43.73Unverified
EQ-BenchIntel/neural-chat-7b-v3-1EQ-Bench Score43.61Unverified
EQ-BenchOpenAI text-davinci-002EQ-Bench Score39.44Unverified
EQ-Benchopenchat/openchat 3.5EQ-Bench Score37.08Unverified
EQ-Benchlmsys/vicuna-33b-v1.3EQ-Bench Score36.52Unverified
EQ-Benchmeta-llama/Llama-2-13b-chat-hfEQ-Bench Score33.02Unverified
EQ-Benchlmsys/vicuna-13b-v1.1EQ-Bench Score32.85Unverified
EQ-Benchmeta-llama/Llama-2-7b-chat-hfEQ-Bench Score25.43Unverified
EQ-BenchKoala 13BEQ-Bench Score24.92Unverified
EQ-Benchlmsys/vicuna-7b-v1.1EQ-Bench Score22.24Unverified
EQ-BenchOpenAI text-davinci-001EQ-Bench Score15.19Unverified
EQ-BenchOpenAI ADAEQ-Bench Score2.25Unverified

Reproductions