SOTAVerified

Benchmarking

Papers

Showing 701725 of 5548 papers

TitleStatusHype
Mind the Gap: Benchmarking Spatial Reasoning in Vision-Language ModelsCode1
The Coralscapes Dataset: Semantic Scene Understanding in Coral ReefsCode1
Writing as a testbed for open ended agents0
Benchmarking Object Detectors under Real-World Distribution Shifts in Satellite ImageryCode1
Mining-Gym: A Configurable RL Benchmarking Environment for Truck Dispatch SchedulingCode0
LLM Benchmarking with LLaMA2: Evaluating Code Development Performance Across Multiple Programming LanguagesCode0
Benchmarking Multi-modal Semantic Segmentation under Sensor Failures: Missing and Noisy Modality RobustnessCode1
Enhancing Multi-Label Emotion Analysis and Corresponding Intensities for Ethiopian Languages0
Benchmarking Post-Hoc Unknown-Category Detection in Food Recognition0
EvAnimate: Event-conditioned Image-to-Video Generation for Human Animation0
Benchmarking Burst Super-Resolution for Polarization Images: Noise Dataset and Analysis0
GeoBenchX: Benchmarking LLMs for Multistep Geospatial TasksCode1
SceneSplat: Gaussian Splatting-based Scene Understanding with Vision-Language PretrainingCode3
A Study on Neuro-Symbolic Artificial Intelligence: Healthcare Perspectives0
Unmasking Deceptive Visuals: Benchmarking Multimodal Large Language Models on Misleading Chart Question Answering0
Regularization of ML models for Earth systems by using longer model timesteps0
Accurate Peak Detection in Multimodal Optimization via Approximated Landscape LearningCode0
IceBench: A Benchmark for Deep Learning based Sea Ice Type ClassificationCode0
CardioTabNet: A Novel Hybrid Transformer Model for Heart Disease Prediction using Tabular Medical Data0
4D-Bench: Benchmarking Multi-modal Large Language Models for 4D Object UnderstandingCode0
V2P-Bench: Evaluating Video-Language Understanding with Visual Prompts for Better Human-Model InteractionCode1
Benchmark Dataset for Pore-Scale CO2-Water Interaction0
CausalRivers -- Scaling up benchmarking of causal discovery for real-world time-series0
Decouple and Track: Benchmarking and Improving Video Diffusion Transformers for Motion TransferCode2
ContextGNN goes to Elliot: Towards Benchmarking Relational Deep Learning for Static Link Prediction (aka Personalized Item Recommendation)Code0
Show:102550
← PrevPage 29 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified