SOTAVerified

Benchmarking

Papers

Showing 41514200 of 5548 papers

TitleStatusHype
PISTOL: Dataset Compilation Pipeline for Structural Unlearning of LLMs0
Pitfalls of topology-aware image segmentation0
pix2pockets: Shot Suggestions in 8-Ball Pool from a Single Image in the Wild0
A Computer Vision System to Localize and Classify Wastes on the Streets0
Benchmarking performance, explainability, and evaluation strategies of vision-language models for surgery: Challenges and opportunities0
A Comprehensive Survey on Video Scene Parsing:Advances, Challenges, and Prospects0
PKLot-A robust dataset for parking lot classification0
PLAICraft: Large-Scale Time-Aligned Vision-Speech-Action Dataset for Embodied AI0
BEADs: Bias Evaluation Across Domains0
BEACON: A Benchmark for Efficient and Accurate Counting of Subgraphs0
Plant in Cupboard, Orange on Rably, Inat Aphone. Benchmarking Incremental Learning of Situation and Language Model using a Text-Simulated Situated Environment0
BBOB Instance Analysis: Landscape Properties and Algorithm Performance across Problem Instances0
Bayesian Neural Networks at Scale: A Performance Analysis and Pruning Study0
Bayesian Multi-type Mean Field Multi-agent Imitation Learning0
White Men Lead, Black Women Help? Benchmarking and Mitigating Language Agency Social Biases in LLMs0
A Bayesian Model for Bivariate Causal Inference0
A Comprehensive Study on the Robustness of Image Classification and Object Detection in Remote Sensing: Surveying and Benchmarking0
A Comprehensive Study on Robustness of Image Classification Models: Benchmarking and Rethinking0
Barkour: Benchmarking Animal-level Agility with Quadruped Robots0
BanglaNLP at BLP-2023 Task 1: Benchmarking different Transformer Models for Violence Inciting Text Detection in Bengali0
Point Cloud Compression and Objective Quality Assessment: A Survey0
Point Cloud Objective Quality: Benchmarking Features and Quality Evaluation0
Polarization and Index Modulations: a Theoretical and Practical Perspective0
Policy Entropy for Out-of-Distribution Classification0
U2-BENCH: Benchmarking Large Vision-Language Models on Ultrasound Understanding0
BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games0
Polyp-E: Benchmarking the Robustness of Deep Segmentation Models via Polyp Editing0
Balanced Random Survival Forests for Extremely Unbalanced, Right Censored Data0
A Comprehensive Study on Dataset Distillation: Performance, Privacy, Robustness and Fairness0
Portfolio Benchmarking under Drawdown Constraint and Stochastic Sharpe Ratio0
PoseBench: Benchmarking the Robustness of Pose Estimation Models under Corruptions0
Pose Estimation for Non-Cooperative Spacecraft Rendezvous Using Convolutional Neural Networks0
BAIT: Benchmarking (Embedding) Architectures for Interactive Theorem-Proving0
Position: AI Competitions Provide the Gold Standard for Empirical Rigor in GenAI Evaluation0
BAGELS: Benchmarking the Automated Generation and Extraction of Limitations from Scholarly Text0
Position: Benchmarking is Limited in Reinforcement Learning Research0
Position: Graph Learning Will Lose Relevance Due To Poor Benchmarks0
Backdoor-based Explainable AI Benchmark for High Fidelity Evaluation of Attribution Methods0
Position: There are no Champions in Long-Term Time Series Forecasting0
Post-FEC BER Benchmarking for Bit-Interleaved Coded Modulation with Probabilistic Shaping0
Post-hoc labeling of arbitrary EEG recordings for data-efficient evaluation of neural decoding methods0
Deep Neural Operator Driven Real Time Inference for Nuclear Systems to Enable Digital Twin Solutions0
PowerGraph: A power grid benchmark dataset for graph neural networks0
Power Line Communication vs. Talkative Power Conversion: A Benchmarking Study0
AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMs0
UAV-Flow Colosseo: A Real-World Benchmark for Flying-on-a-Word UAV Imitation Learning0
UAV Immersive Video Streaming: A Comprehensive Survey, Benchmarking, and Open Challenges0
Practical Design and Benchmarking of Generative AI Applications for Surgical Billing and Coding0
A Video is Worth 10,000 Words: Training and Benchmarking with Diverse Captions for Better Long Video Retrieval0
Practical, Fast and Robust Point Cloud Registration for 3D Scene Stitching and Object Localization0
Show:102550
← PrevPage 84 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified