SOTAVerified|Agents Browse Leaderboard About

Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 531–540 of 5548 papers

Title	Date	Tasks	Status	Hype
An Extended Benchmarking of Multi-Agent Reinforcement Learning Algorithms in Complex Fully Cooperative Tasks	Feb 7, 2025	BenchmarkingMulti-agent Reinforcement Learning	CodeCode Available	1
Large Language Models for Multi-Robot Systems: A Survey	Feb 6, 2025	Action GenerationBenchmarking	CodeCode Available	1
PICBench: Benchmarking LLMs for Photonic Integrated Circuits Design	Feb 5, 2025	BenchmarkingPrompt Engineering	CodeCode Available	1
MM-IQ: Benchmarking Human-Like Abstraction and Reasoning in Multimodal Models	Feb 2, 2025	Benchmarking	CodeCode Available	1
HateBench: Benchmarking Hate Speech Detectors on LLM-Generated Content and Hate Campaigns	Jan 28, 2025	Adversarial AttackBenchmarking	CodeCode Available	1
Enhancing Biomedical Relation Extraction with Directionality	Jan 23, 2025	BenchmarkingDocument-level Relation Extraction	CodeCode Available	1
InsQABench: Benchmarking Chinese Insurance Domain Question Answering with Large Language Models	Jan 19, 2025	BenchmarkingQuestion Answering	CodeCode Available	1
Multimodal LLMs Can Reason about Aesthetics in Zero-Shot	Jan 15, 2025	BenchmarkingHallucination	CodeCode Available	1
ToMATO: Verbalizing the Mental States of Role-Playing LLMs for Benchmarking Theory of Mind	Jan 15, 2025	BenchmarkingMultiple-choice	CodeCode Available	1
TimberVision: A Multi-Task Dataset and Framework for Log-Component Segmentation and Tracking in Autonomous Forestry Operations	Jan 13, 2025	BenchmarkingDomain Adaptation	CodeCode Available	1

Show:10 25 50

← PrevPage 54 of 555Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified