SOTAVerified

Benchmarking

Papers

Showing 41764200 of 5548 papers

TitleStatusHype
BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games0
Polyp-E: Benchmarking the Robustness of Deep Segmentation Models via Polyp Editing0
Balanced Random Survival Forests for Extremely Unbalanced, Right Censored Data0
A Comprehensive Study on Dataset Distillation: Performance, Privacy, Robustness and Fairness0
Portfolio Benchmarking under Drawdown Constraint and Stochastic Sharpe Ratio0
PoseBench: Benchmarking the Robustness of Pose Estimation Models under Corruptions0
Pose Estimation for Non-Cooperative Spacecraft Rendezvous Using Convolutional Neural Networks0
BAIT: Benchmarking (Embedding) Architectures for Interactive Theorem-Proving0
Position: AI Competitions Provide the Gold Standard for Empirical Rigor in GenAI Evaluation0
BAGELS: Benchmarking the Automated Generation and Extraction of Limitations from Scholarly Text0
Position: Benchmarking is Limited in Reinforcement Learning Research0
Position: Graph Learning Will Lose Relevance Due To Poor Benchmarks0
Backdoor-based Explainable AI Benchmark for High Fidelity Evaluation of Attribution Methods0
Position: There are no Champions in Long-Term Time Series Forecasting0
Post-FEC BER Benchmarking for Bit-Interleaved Coded Modulation with Probabilistic Shaping0
Post-hoc labeling of arbitrary EEG recordings for data-efficient evaluation of neural decoding methods0
Deep Neural Operator Driven Real Time Inference for Nuclear Systems to Enable Digital Twin Solutions0
PowerGraph: A power grid benchmark dataset for graph neural networks0
Power Line Communication vs. Talkative Power Conversion: A Benchmarking Study0
AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMs0
UAV-Flow Colosseo: A Real-World Benchmark for Flying-on-a-Word UAV Imitation Learning0
UAV Immersive Video Streaming: A Comprehensive Survey, Benchmarking, and Open Challenges0
Practical Design and Benchmarking of Generative AI Applications for Surgical Billing and Coding0
A Video is Worth 10,000 Words: Training and Benchmarking with Diverse Captions for Better Long Video Retrieval0
Practical, Fast and Robust Point Cloud Registration for 3D Scene Stitching and Object Localization0
Show:102550
← PrevPage 168 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified