SOTAVerified|Agents Browse Leaderboard About

Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1431–1440 of 5548 papers

Title	Date	Tasks	Status	Hype
FairMT-Bench: Benchmarking Fairness for Multi-turn Dialogue in Conversational LLMs	Oct 25, 2024	BenchmarkingFairness	—Unverified	0
AgentSense: Benchmarking Social Intelligence of Language Agents through Interactive Scenarios	Oct 25, 2024	BenchmarkingDiversity	CodeCode Available	1
Open6DOR: Benchmarking Open-instruction 6-DoF Object Rearrangement and A VLM-based Approach	Oct 24, 2024	BenchmarkingInstruction Following	CodeCode Available	2
Conditional diffusions for amortized neural posterior estimation	Oct 24, 2024	Bayesian InferenceBenchmarking	CodeCode Available	0
Benchmarking Graph Learning for Drug-Drug Interaction Prediction	Oct 24, 2024	BenchmarkingGraph Learning	—Unverified	0
From Blind Solvers to Logical Thinkers: Benchmarking LLMs' Logical Integrity on Faulty Mathematical Problems	Oct 24, 2024	BenchmarkingCommon Sense Reasoning	—Unverified	0
Robust Watermarking Using Generative Priors Against Image Editing: From Benchmarking to Advances	Oct 24, 2024	BenchmarkingImage to Video Generation	CodeCode Available	3
Towards Better Open-Ended Text Generation: A Multicriteria Evaluation Framework	Oct 24, 2024	BenchmarkingDiversity	CodeCode Available	0
Benchmarking Foundation Models on Exceptional Cases: Dataset Creation and Validation	Oct 23, 2024	ArticlesBenchmarking	CodeCode Available	0
Benchmarking Floworks against OpenAI & Anthropic: A Novel Framework for Enhanced LLM Function Calling	Oct 23, 2024	Benchmarking	—Unverified	0

Show:10 25 50

← PrevPage 144 of 555Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified