SOTAVerified|Agents Browse Leaderboard About

Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1451–1460 of 5548 papers

Title	Date	Tasks	Status	Hype
RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style	Oct 21, 2024	BenchmarkingLanguage Modeling	CodeCode Available	2
Multi-IF: Benchmarking LLMs on Multi-Turn and Multilingual Instructions Following	Oct 21, 2024	BenchmarkingInstruction Following	CodeCode Available	2
Hiding in Plain Sight: Reframing Hardware Trojan Benchmarking as a Hide&Seek Modification	Oct 21, 2024	Benchmarking	—Unverified	0
Comprehensive benchmarking of large language models for RNA secondary structure prediction	Oct 21, 2024	Benchmarking	CodeCode Available	1
A Framework for Evaluating Predictive Models Using Synthetic Image Covariates and Longitudinal Data	Oct 21, 2024	Benchmarking	—Unverified	0
Sketch2Code: Evaluating Vision-Language Models for Interactive Web Design Prototyping	Oct 21, 2024	Benchmarking	—Unverified	0
Dynamic Intelligence Assessment: Benchmarking LLMs on the Road to AGI with a Focus on Model Confidence	Oct 20, 2024	Benchmarking	—Unverified	0
FlexMol: A Flexible Toolkit for Benchmarking Molecular Relational Learning	Oct 19, 2024	BenchmarkingDrug Discovery	CodeCode Available	0
IntersectionZoo: Eco-driving for Benchmarking Multi-Agent Contextual Reinforcement Learning	Oct 19, 2024	BenchmarkingMulti-agent Reinforcement Learning	CodeCode Available	2
SPA-Bench: A Comprehensive Benchmark for SmartPhone Agent Evaluation	Oct 19, 2024	AI AgentBenchmarking	CodeCode Available	2

Show:10 25 50

← PrevPage 146 of 555Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified