SOTAVerified

Benchmarking

Papers

Showing 151200 of 5548 papers

TitleStatusHype
Open CaptchaWorld: A Comprehensive Web-based Platform for Testing and Benchmarking Multimodal LLM AgentsCode2
VERINA: Benchmarking Verifiable Code GenerationCode2
LLaMEA-BO: A Large Language Model Evolutionary Algorithm for Automatically Generating Bayesian Optimization AlgorithmsCode2
Benchmarking Laparoscopic Surgical Image Restoration and BeyondCode2
CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and InteractionsCode2
GlobalGeoTree: A Multi-Granular Vision-Language Dataset for Global Tree Species ClassificationCode2
MMLongBench: Benchmarking Long-Context Vision-Language Models Effectively and ThoroughlyCode2
Large Language Model Psychometrics: A Systematic Review of Evaluation, Validation, and EnhancementCode2
FormalMATH: Benchmarking Formal Mathematical Reasoning of Large Language ModelsCode2
MINERVA: Evaluating Complex Video ReasoningCode2
Vision Mamba in Remote Sensing: A Comprehensive Survey of Techniques, Applications and OutlookCode2
BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in ChineseCode2
WASP: Benchmarking Web Agent Security Against Prompt Injection AttacksCode2
Know Me, Respond to Me: Benchmarking LLMs for Dynamic User Profiling and Personalized Responses at ScaleCode2
HypoBench: Towards Systematic and Principled Benchmarking for Hypothesis GenerationCode2
TorchFX: A modern approach to Audio DSP with PyTorch and GPU accelerationCode2
Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual EditingCode2
Benchmarking Synthetic Tabular Data: A Multi-Dimensional Evaluation FrameworkCode2
Decouple and Track: Benchmarking and Improving Video Diffusion Transformers for Motion TransferCode2
VenusFactory: A Unified Platform for Protein Engineering Data Retrieval and Language Model Fine-TuningCode2
MedAgentsBench: Benchmarking Thinking Models and Agent Frameworks for Complex Medical ReasoningCode2
Griffin: Aerial-Ground Cooperative Detection and Tracking Dataset and BenchmarkCode2
Medical Hallucinations in Foundation Models and Their Impact on HealthcareCode2
Benchmarking Retrieval-Augmented Generation in Multi-Modal ContextsCode2
TritonBench: Benchmarking Large Language Model Capabilities for Generating Triton OperatorsCode2
FetalCLIP: A Visual-Language Foundation Model for Fetal Ultrasound Image AnalysisCode2
Fino1: On the Transferability of Reasoning Enhanced LLMs to FinanceCode2
SoK: Benchmarking Poisoning Attacks and Defenses in Federated LearningCode2
Speculative Prefill: Turbocharging TTFT with Lightweight and Training-Free Token Importance EstimationCode2
SafeRAG: Benchmarking Security in Retrieval-Augmented Generation of Large Language ModelCode2
Scalable Benchmarking and Robust Learning for Noise-Free Ego-Motion and 3D Reconstruction from Noisy VideoCode2
OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?Code2
nnWNet: Rethinking the Use of Transformers in Biomedical Image Segmentation and Calling for a Unified Evaluation BenchmarkCode2
An OpenMind for 3D medical vision self-supervised learningCode2
XRAG: eXamining the Core -- Benchmarking Foundational Components in Advanced Retrieval-Augmented GenerationCode2
AutoTrust: Benchmarking Trustworthiness in Large Vision Language Models for Autonomous DrivingCode2
Open Universal Arabic ASR LeaderboardCode2
NeuralPLexer3: Accurate Biomolecular Complex Structure Prediction with Flow ModelsCode2
EvalGIM: A Library for Evaluating Generative Image ModelsCode2
Neptune: The Long Orbit to Benchmarking Long Video UnderstandingCode2
Video Quality Assessment: A Comprehensive SurveyCode2
Commit0: Library Generation from ScratchCode2
OpenQDC: Open Quantum Data CommonsCode2
GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial TasksCode2
HourVideo: 1-Hour Video-Language UnderstandingCode2
Interaction2Code: Benchmarking MLLM-based Interactive Webpage Code Generation from Interactive PrototypingCode2
LLM-Inference-Bench: Inference Benchmarking of Large Language Models on AI AcceleratorsCode2
InjecGuard: Benchmarking and Mitigating Over-defense in Prompt Injection Guardrail ModelsCode2
CORAL: Benchmarking Multi-turn Conversational Retrieval-Augmentation GenerationCode2
PC-Gym: Benchmark Environments For Process Control ProblemsCode2
Show:102550
← PrevPage 4 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified