SOTAVerified

Benchmarking

Papers

Showing 301350 of 5548 papers

TitleStatusHype
Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual EditingCode2
EV2Gym: A Flexible V2G Simulator for EV Smart Charging Research and BenchmarkingCode2
GeoBench: Benchmarking and Analyzing Monocular Geometry Estimation ModelsCode2
An OpenMind for 3D medical vision self-supervised learningCode2
GlobalGeoTree: A Multi-Granular Vision-Language Dataset for Global Tree Species ClassificationCode2
GPT-Fathom: Benchmarking Large Language Models to Decipher the Evolutionary Path towards GPT-4 and BeyondCode2
GSplatLoc: Grounding Keypoint Descriptors into 3D Gaussian Splatting for Improved Visual LocalizationCode2
GV-Bench: Benchmarking Local Feature Matching for Geometric Verification of Long-term Loop Closure DetectionCode2
EffiBench: Benchmarking the Efficiency of Automatically Generated CodeCode2
EvalGIM: A Library for Evaluating Generative Image ModelsCode2
Fast Vision Transformers with HiLo AttentionCode2
GenoTEX: An LLM Agent Benchmark for Automated Gene Expression Data AnalysisCode2
HumanRefiner: Benchmarking Abnormal Human Generation and Refining with Coarse-to-fine Pose-Reversible GuidanceCode2
Interactive Agents: Simulating Counselor-Client Psychological Counseling via Role-Playing LLM-to-LLM InteractionsCode2
IML-ViT: Benchmarking Image Manipulation Localization by Vision TransformerCode2
Immersive Neural Graphics PrimitivesCode2
AIR-Bench: Benchmarking Large Audio-Language Models via Generative ComprehensionCode2
InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model AgentsCode2
DreamBench++: A Human-Aligned Benchmark for Personalized Image GenerationCode2
IntersectionZoo: Eco-driving for Benchmarking Multi-Agent Contextual Reinforcement LearningCode2
DrafterBench: Benchmarking Large Language Models for Tasks Automation in Civil EngineeringCode2
AiTLAS: Artificial Intelligence Toolbox for Earth ObservationCode2
LLM-Based Multi-Agent Systems are Scalable Graph Generative ModelsCode2
Deep Visual Geo-localization BenchmarkCode2
Datasets and Benchmarks for Offline Safe Reinforcement LearningCode2
Know Me, Respond to Me: Benchmarking LLMs for Dynamic User Profiling and Personalized Responses at ScaleCode2
DaisyRec 2.0: Benchmarking Recommendation for Rigorous EvaluationCode2
A large annotated medical image dataset for the development and evaluation of segmentation algorithmsCode2
Large-Scale Multi-Center CT and MRI Segmentation of Pancreas with Deep LearningCode2
LawBench: Benchmarking Legal Knowledge of Large Language ModelsCode2
Decouple and Track: Benchmarking and Improving Video Diffusion Transformers for Motion TransferCode2
Desbordante: from benchmarking suite to high-performance science-intensive data profiler (preprint)Code2
MLAgentBench: Evaluating Language Agents on Machine Learning ExperimentationCode2
LLaMEA-BO: A Large Language Model Evolutionary Algorithm for Automatically Generating Bayesian Optimization AlgorithmsCode2
State-specific protein-ligand complex structure prediction with a multi-scale deep generative modelCode2
Craftium: An Extensible Framework for Creating Reinforcement Learning EnvironmentsCode2
LoRA-XS: Low-Rank Adaptation with Extremely Small Number of ParametersCode2
LoTa-Bench: Benchmarking Language-oriented Task Planners for Embodied AgentsCode2
CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and InteractionsCode2
Benchmarking and Improving Detail Image CaptionCode2
MathOdyssey: Benchmarking Mathematical Problem-Solving Skills in Large Language Models Using Odyssey Math DataCode2
CORAL: Benchmarking Multi-turn Conversational Retrieval-Augmentation GenerationCode2
Benchmarking Benchmark Leakage in Large Language ModelsCode2
Benchmarking Complex Instruction-Following with Multiple Constraints CompositionCode2
Customizable Perturbation Synthesis for Robust SLAM BenchmarkingCode2
MINERVA: Evaluating Complex Video ReasoningCode2
EasyTPP: Towards Open Benchmarking Temporal Point ProcessesCode2
COALA: A Practical and Vision-Centric Federated Learning PlatformCode2
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language ModelsCode2
CoIR: A Comprehensive Benchmark for Code Information Retrieval ModelsCode2
Show:102550
← PrevPage 7 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified