SOTAVerified|Agents Browse Leaderboard About

Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1471–1480 of 5548 papers

Title	Date	Tasks	Status	Hype
Trust but Verify: Programmatic VLM Evaluation in the Wild	Oct 17, 2024	BenchmarkingLanguage Modelling	—Unverified	0
ORCHID: A Chinese Debate Corpus for Target-Independent Stance Detection and Argumentative Dialogue Summarization	Oct 17, 2024	BenchmarkingStance Detection	CodeCode Available	0
Configurable Embodied Data Generation for Class-Agnostic RGB-D Video Segmentation	Oct 16, 2024	BenchmarkingPanoptic Segmentation	—Unverified	0
Understanding the Role of LLMs in Multimodal Evaluation Benchmarks	Oct 16, 2024	BenchmarkingLarge Language Model	CodeCode Available	0
WorldMedQA-V: a multilingual, multimodal medical examination dataset for multimodal language models evaluation	Oct 16, 2024	BenchmarkingFairness	CodeCode Available	1
AERO: Softmax-Only LLMs for Efficient Private Inference	Oct 16, 2024	BenchmarkingDecoder	—Unverified	0
Benchmarking Defeasible Reasoning with Large Language Models -- Initial Experiments and Future Directions	Oct 16, 2024	Benchmarking	—Unverified	0
Open Ko-LLM Leaderboard2: Bridging Foundational and Practical Evaluation for Korean LLMs	Oct 16, 2024	Benchmarking	—Unverified	0
MLPerf Power: Benchmarking the Energy Efficiency of Machine Learning Systems from Microwatts to Megawatts for Sustainable AI	Oct 15, 2024	Benchmarking	CodeCode Available	4
Benchmarking Data Efficiency in Δ-ML and Multifidelity Models for Quantum Chemistry	Oct 15, 2024	Benchmarking	CodeCode Available	0

Show:10 25 50

← PrevPage 148 of 555Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified