SOTAVerified

Benchmarking

Papers

Showing 251300 of 5548 papers

TitleStatusHype
Benchmarking Predictive Coding Networks -- Made SimpleCode2
Griffin: Aerial-Ground Cooperative Detection and Tracking Dataset and BenchmarkCode2
GeoBench: Benchmarking and Analyzing Monocular Geometry Estimation ModelsCode2
Benchmarking Potential Based Rewards for Learning Humanoid LocomotionCode2
GenoTEX: An LLM Agent Benchmark for Automated Gene Expression Data AnalysisCode2
GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial TasksCode2
GSCodec Studio: A Modular Framework for Gaussian Splat CompressionCode2
HLSFactory: A Framework Empowering High-Level Synthesis Datasets for Machine Learning and BeyondCode2
GDGB: A Benchmark for Generative Dynamic Text-Attributed Graph LearningCode2
Benchmarking Benchmark Leakage in Large Language ModelsCode2
PyGraft: Configurable Generation of Synthetic Schemas and Knowledge Graphs at Your FingertipsCode2
From Perfect to Noisy World Simulation: Customizable Embodied Multi-modal Perturbations for SLAM Robustness BenchmarkingCode2
FormalMATH: Benchmarking Formal Mathematical Reasoning of Large Language ModelsCode2
FluidLab: A Differentiable Environment for Benchmarking Complex Fluid ManipulationCode2
Fortuna: A Library for Uncertainty Quantification in Deep LearningCode2
MMLongBench: Benchmarking Long-Context Vision-Language Models Effectively and ThoroughlyCode2
Fino1: On the Transferability of Reasoning Enhanced LLMs to FinanceCode2
Foundational Models Defining a New Era in Vision: A Survey and OutlookCode2
HoTPP Benchmark: Are We Good at the Long Horizon Events Forecasting?Code2
LawBench: Benchmarking Legal Knowledge of Large Language ModelsCode2
Extended Agriculture-Vision: An Extension of a Large Aerial Image Dataset for Agricultural Pattern AnalysisCode2
Exponentially Faster Language ModellingCode2
FairMedFM: Fairness Benchmarking for Medical Imaging Foundation ModelsCode2
Event-Based Motion MagnificationCode2
Fast Vision Transformers with HiLo AttentionCode2
EV2Gym: A Flexible V2G Simulator for EV Smart Charging Research and BenchmarkingCode2
EQ-Bench: An Emotional Intelligence Benchmark for Large Language ModelsCode2
EvalGIM: A Library for Evaluating Generative Image ModelsCode2
Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual EditingCode2
Evaluating Large-Vocabulary Object Detectors: The Devil is in the DetailsCode2
FedGraph: A Research Library and Benchmark for Federated Graph LearningCode2
State-specific protein-ligand complex structure prediction with a multi-scale deep generative modelCode2
LLM-Based Multi-Agent Systems are Scalable Graph Generative ModelsCode2
EasyTPP: Towards Open Benchmarking Temporal Point ProcessesCode2
DrafterBench: Benchmarking Large Language Models for Tasks Automation in Civil EngineeringCode2
DreamBench++: A Human-Aligned Benchmark for Personalized Image GenerationCode2
Desbordante: from benchmarking suite to high-performance science-intensive data profiler (preprint)Code2
Decouple and Track: Benchmarking and Improving Video Diffusion Transformers for Motion TransferCode2
Datasets and Benchmarks for Offline Safe Reinforcement LearningCode2
Customizable Perturbation Synthesis for Robust SLAM BenchmarkingCode2
CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and InteractionsCode2
DaisyRec 2.0: Benchmarking Recommendation for Rigorous EvaluationCode2
CORAL: Benchmarking Multi-turn Conversational Retrieval-Augmentation GenerationCode2
Craftium: An Extensible Framework for Creating Reinforcement Learning EnvironmentsCode2
Benchmarking Agentic Workflow GenerationCode2
CoqPilot, a plugin for LLM-based generation of proofsCode2
Deep Visual Geo-localization BenchmarkCode2
EffiBench: Benchmarking the Efficiency of Automatically Generated CodeCode2
FetalCLIP: A Visual-Language Foundation Model for Fetal Ultrasound Image AnalysisCode2
BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval ModelsCode2
Show:102550
← PrevPage 6 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified