SOTAVerified

Benchmarking

Papers

Showing 301325 of 5548 papers

TitleStatusHype
Exponentially Faster Language ModellingCode2
Extended Agriculture-Vision: An Extension of a Large Aerial Image Dataset for Agricultural Pattern AnalysisCode2
An OpenMind for 3D medical vision self-supervised learningCode2
FaceScore: Benchmarking and Enhancing Face Quality in Human GenerationCode2
FluidLab: A Differentiable Environment for Benchmarking Complex Fluid ManipulationCode2
FormalMATH: Benchmarking Formal Mathematical Reasoning of Large Language ModelsCode2
FetalCLIP: A Visual-Language Foundation Model for Fetal Ultrasound Image AnalysisCode2
GDGB: A Benchmark for Generative Dynamic Text-Attributed Graph LearningCode2
HumanRefiner: Benchmarking Abnormal Human Generation and Refining with Coarse-to-fine Pose-Reversible GuidanceCode2
DaisyRec 2.0: Benchmarking Recommendation for Rigorous EvaluationCode2
Benchmarking Deep Reinforcement Learning for Continuous ControlCode2
GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial TasksCode2
Datasets and Benchmarks for Offline Safe Reinforcement LearningCode2
GPT-Fathom: Benchmarking Large Language Models to Decipher the Evolutionary Path towards GPT-4 and BeyondCode2
GSplatLoc: Grounding Keypoint Descriptors into 3D Gaussian Splatting for Improved Visual LocalizationCode2
GV-Bench: Benchmarking Local Feature Matching for Geometric Verification of Long-term Loop Closure DetectionCode2
CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and InteractionsCode2
Customizable Perturbation Synthesis for Robust SLAM BenchmarkingCode2
Decouple and Track: Benchmarking and Improving Video Diffusion Transformers for Motion TransferCode2
CORAL: Benchmarking Multi-turn Conversational Retrieval-Augmentation GenerationCode2
IML-ViT: Benchmarking Image Manipulation Localization by Vision TransformerCode2
AiTLAS: Artificial Intelligence Toolbox for Earth ObservationCode2
InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model AgentsCode2
InjecGuard: Benchmarking and Mitigating Over-defense in Prompt Injection Guardrail ModelsCode2
CoqPilot, a plugin for LLM-based generation of proofsCode2
Show:102550
← PrevPage 13 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified