SOTAVerified|Agents Browse Leaderboard About

Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1421–1430 of 5548 papers

Title	Date	Tasks	Status	Hype
Autonomous Microscopy Experiments through Large Language Model Agents	Dec 18, 2024	BenchmarkingExperimental Design	CodeCode Available	1
CriticBench: Benchmarking LLMs for Critique-Correct Reasoning	Feb 22, 2024	Benchmarking	CodeCode Available	1
Autonomous Reinforcement Learning: Formalism and Benchmarking	Dec 17, 2021	Benchmarkingreinforcement-learning	CodeCode Available	1
CARLA: A Python Library to Benchmark Algorithmic Recourse and Counterfactual Explanation Algorithms	Aug 2, 2021	Benchmarkingcounterfactual	CodeCode Available	1
COVID-19 event extraction from Twitter via extractive question answering with continuous prompts	Mar 19, 2023	BenchmarkingEvent Extraction	CodeCode Available	1
Benchmarking Vision Language Model Unlearning via Fictitious Facial Identity Dataset	Nov 5, 2024	BenchmarkingLanguage Modeling	CodeCode Available	1
Benchmarking Vision, Language, & Action Models in Procedurally Generated, Open Ended Action Environments	May 8, 2025	BenchmarkingPrompt Engineering	CodeCode Available	1
Benchmarking Vision, Language, & Action Models on Robotic Learning Tasks	Nov 4, 2024	Action GenerationBenchmarking	CodeCode Available	1
Cross-Modal Bidirectional Interaction Model for Referring Remote Sensing Image Segmentation	Oct 11, 2024	BenchmarkingImage Segmentation	CodeCode Available	1
CounselBench: A Large-Scale Expert Evaluation and Adversarial Benchmark of Large Language Models in Mental Health Counseling	Jun 10, 2025	Benchmarking	CodeCode Available	1

Show:10 25 50

← PrevPage 143 of 555Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified