SOTAVerified|Agents Browse Leaderboard About

Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 891–900 of 5548 papers

Title	Date	Tasks	Status	Hype
TritonBench: Benchmarking Large Language Model Capabilities for Generating Triton Operators	Feb 20, 2025	BenchmarkingCode Generation	CodeCode Available	2
Statistical Scenario Modelling and Lookalike Distributions for Multi-Variate AI Risk	Feb 20, 2025	Benchmarking	—Unverified	0
Probabilistic Robustness in Deep Learning: A Concise yet Comprehensive Guide	Feb 20, 2025	Adversarial RobustnessBenchmarking	—Unverified	0
Benchmarking Multimodal RAG through a Chart-based Document Question-Answering Generation Framework	Feb 20, 2025	BenchmarkingQuestion Answering	CodeCode Available	0
Sentence Smith: Formally Controllable Text Transformation and its Application to Evaluation of Text Embedding Models	Feb 20, 2025	BenchmarkingSentence	—Unverified	0
Position: Graph Learning Will Lose Relevance Due To Poor Benchmarks	Feb 20, 2025	BenchmarkingCombinatorial Optimization	—Unverified	0
Line Goes Up? Inherent Limitations of Benchmarks for Evaluating Large Language Models	Feb 20, 2025	Benchmarking	—Unverified	0
Beyond Self-Talk: A Communication-Centric Survey of LLM-Based Multi-Agent Systems	Feb 20, 2025	BenchmarkingDecision Making	—Unverified	0
FetalCLIP: A Visual-Language Foundation Model for Fetal Ultrasound Image Analysis	Feb 20, 2025	Age EstimationBenchmarking	CodeCode Available	2
PredictaBoard: Benchmarking LLM Score Predictability	Feb 20, 2025	BenchmarkingCommon Sense Reasoning	CodeCode Available	0

Show:10 25 50

← PrevPage 90 of 555Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified