SOTAVerified

Benchmarking

Papers

Showing 201250 of 5548 papers

TitleStatusHype
GlobalGeoTree: A Multi-Granular Vision-Language Dataset for Global Tree Species ClassificationCode2
GPT-Fathom: Benchmarking Large Language Models to Decipher the Evolutionary Path towards GPT-4 and BeyondCode2
FaceScore: Benchmarking and Enhancing Face Quality in Human GenerationCode2
GSplatLoc: Grounding Keypoint Descriptors into 3D Gaussian Splatting for Improved Visual LocalizationCode2
Fino1: On the Transferability of Reasoning Enhanced LLMs to FinanceCode2
Fortuna: A Library for Uncertainty Quantification in Deep LearningCode2
GenoTEX: An LLM Agent Benchmark for Automated Gene Expression Data AnalysisCode2
HourVideo: 1-Hour Video-Language UnderstandingCode2
Needle In A Video Haystack: A Scalable Synthetic Evaluator for Video MLLMsCode2
HypoBench: Towards Systematic and Principled Benchmarking for Hypothesis GenerationCode2
Immersive Neural Graphics PrimitivesCode2
InfiAgent-DABench: Evaluating Agents on Data Analysis TasksCode2
Benchmarking Benchmark Leakage in Large Language ModelsCode2
InjecGuard: Benchmarking and Mitigating Over-defense in Prompt Injection Guardrail ModelsCode2
Battle of the Backbones: A Large-Scale Comparison of Pretrained Models across Computer Vision TasksCode2
Exponentially Faster Language ModellingCode2
InterCode: Standardizing and Benchmarking Interactive Coding with Execution FeedbackCode2
IntersectionZoo: Eco-driving for Benchmarking Multi-Agent Contextual Reinforcement LearningCode2
An OpenMind for 3D medical vision self-supervised learningCode2
Event-Based Motion MagnificationCode2
Evaluating Large-Vocabulary Object Detectors: The Devil is in the DetailsCode2
Extended Agriculture-Vision: An Extension of a Large Aerial Image Dataset for Agricultural Pattern AnalysisCode2
BARS: Towards Open Benchmarking for Recommender SystemsCode2
EV2Gym: A Flexible V2G Simulator for EV Smart Charging Research and BenchmarkingCode2
Learning to Fly -- a Gym Environment with PyBullet Physics for Reinforcement Learning of Multi-agent Quadcopter ControlCode2
Learning Transferable Visual Models From Natural Language SupervisionCode2
Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual EditingCode2
EQ-Bench: An Emotional Intelligence Benchmark for Large Language ModelsCode2
EvalGIM: A Library for Evaluating Generative Image ModelsCode2
Advances in APPFL: A Comprehensive and Extensible Federated Learning FrameworkCode2
FairMedFM: Fairness Benchmarking for Medical Imaging Foundation ModelsCode2
LoRA-XS: Low-Rank Adaptation with Extremely Small Number of ParametersCode2
A Call to Reflect on Evaluation Practices for Age Estimation: Comparative Analysis of the State-of-the-Art and a Unified BenchmarkCode2
LtU-ILI: An All-in-One Framework for Implicit Inference in Astrophysics and CosmologyCode2
AutoTrust: Benchmarking Trustworthiness in Large Vision Language Models for Autonomous DrivingCode2
AutoPenBench: Benchmarking Generative Agents for Penetration TestingCode2
EffiBench: Benchmarking the Efficiency of Automatically Generated CodeCode2
EasyTPP: Towards Open Benchmarking Temporal Point ProcessesCode2
LLM-Based Multi-Agent Systems are Scalable Graph Generative ModelsCode2
Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMsCode2
DreamBench++: A Human-Aligned Benchmark for Personalized Image GenerationCode2
State-specific protein-ligand complex structure prediction with a multi-scale deep generative modelCode2
Fast Vision Transformers with HiLo AttentionCode2
Desbordante: from benchmarking suite to high-performance science-intensive data profiler (preprint)Code2
A Content-Driven Micro-Video Recommendation Dataset at ScaleCode2
MMLongBench-Doc: Benchmarking Long-context Document Understanding with VisualizationsCode2
Deep Visual Geo-localization BenchmarkCode2
A large annotated medical image dataset for the development and evaluation of segmentation algorithmsCode2
Datasets and Benchmarks for Offline Safe Reinforcement LearningCode2
Decouple and Track: Benchmarking and Improving Video Diffusion Transformers for Motion TransferCode2
Show:102550
← PrevPage 5 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified