SOTAVerified

Benchmarking

Papers

Showing 150 of 5548 papers

TitleStatusHype
Visual Place Recognition for Large-Scale UAV Applications0
MUPAX: Multidimensional Problem Agnostic eXplainable AI0
Training Transformers with Enforced Lipschitz Constants0
Disentangling coincident cell events using deep transfer learning and compressive sensing0
DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action RecognitionCode0
DrafterBench: Benchmarking Large Language Models for Tasks Automation in Civil EngineeringCode2
FLsim: A Modular and Library-Agnostic Simulation Framework for Federated LearningCode0
A Multi-View High-Resolution Foot-Ankle Complex Point Cloud Dataset During Gait for Occlusion-Robust 3D Completion0
DCR: Quantifying Data Contamination in LLMs EvaluationCode0
CodeAssistBench (CAB): Dataset & Benchmarking for Multi-turn Chat-Based Code Assistance0
CodeJudgeBench: Benchmarking LLM-as-a-Judge for Coding Tasks0
Benchmarking and Evaluation of AI Models in Biology: Outcomes and Recommendations from the CZI Virtual Cells Workshop0
MLAR: Multi-layer Large Language Model-based Robotic Process Automation Applicant Tracking0
Ref-Long: Benchmarking the Long-context Referencing Capability of Long-context Language ModelsCode0
Calibrated and Robust Foundation Models for Vision-Language and Medical Image Tasks Under Distribution Shift0
Identifying the Smallest Adversarial Load Perturbations that Render DC-OPF InfeasibleCode0
Orchestrator-Agent Trust: A Modular Agentic AI Visual Classification System with Trust-Aware Orchestration and RAG-Based ReasoningCode0
Benchmarking Waitlist Mortality Prediction in Heart Transplantation Through Time-to-Event Modeling using New Longitudinal UNOS Dataset0
A Systematic Analysis of Hybrid Linear Attention0
Hyperspectral Anomaly Detection Methods: A Survey and Comparative Study0
SenseShift6D: Multimodal RGB-D Benchmarking for Robust 6D Pose Estimation across Environment and Sensor VariationsCode0
SQLBarber: A System Leveraging Large Language Models to Generate Customized and Realistic SQL Workloads0
Inaugural MOASEI Competition at AAMAS'2025: A Technical Report0
LLMThinkBench: Towards Basic Math Reasoning and Overthinking in Large Language ModelsCode1
GDGB: A Benchmark for Generative Dynamic Text-Attributed Graph LearningCode2
STRUCTSENSE: A Task-Agnostic Agentic Framework for Structured Information Extraction with Human-In-The-Loop Evaluation and BenchmarkingCode0
LANTERN: A Machine Learning Framework for Lipid Nanoparticle Transfection Efficiency PredictionCode0
Latent Thermodynamic Flows: Unified Representation Learning and Generative Modeling of Temperature-Dependent Behaviors from Limited DataCode1
CORE: Benchmarking LLMs Code Reasoning Capabilities through Static Analysis Tasks0
TransLaw: Benchmarking Large Language Models in Multi-Agent Simulation of the Collaborative Translation0
State and Memory is All You Need for Robust and Reliable AI Agents0
Point Cloud Compression and Objective Quality Assessment: A Survey0
Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge0
mTSBench: Benchmarking Multivariate Time Series Anomaly Detection and Model Selection at ScaleCode0
FeDa4Fair: Client-Level Federated Datasets for Fairness Evaluation0
Benchmarking Deep Learning and Vision Foundation Models for Atypical vs. Normal Mitosis Classification with Cross-Dataset EvaluationCode0
CovDocker: Benchmarking Covalent Drug Design with Tasks, Datasets, and SolutionsCode1
scMamba: A Scalable Foundation Model for Single-Cell Multi-Omics Integration Beyond Highly Variable Feature Selection0
MultiHuman-Testbench: Benchmarking Image Generation for Multiple Humans0
FixCLR: Negative-Class Contrastive Learning for Semi-Supervised Domain Generalization0
AI-Driven MRI-based Brain Tumour Segmentation Benchmarking0
inMOTIFin: a lightweight end-to-end simulation software for regulatory sequencesCode0
HRIBench: Benchmarking Vision-Language Models for Real-Time Human Perception in Human-Robot InteractionCode0
Multimodal Information Retrieval for Open World with Edit Distance Weak Supervision0
Benchmarking Unsupervised Strategies for Anomaly Detection in Multivariate Time SeriesCode0
A Survey of Predictive Maintenance Methods: An Analysis of Prognostics via Classification and Regression0
BrokenVideos: A Benchmark Dataset for Fine-Grained Artifact Localization in AI-Generated Videos0
WattsOnAI: Measuring, Analyzing, and Visualizing Energy and Carbon Footprint of AI WorkloadsCode1
Quantitative Benchmarking of Anomaly Detection Methods in Digital Pathology0
MDR-DeePC: Model-Inspired Distributionally Robust Data-Enabled Predictive Control0
Show:102550
← PrevPage 1 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified