SOTAVerified

Benchmarking

Papers

Showing 251275 of 5548 papers

TitleStatusHype
FM-Planner: Foundation Model Guided Path Planning for Autonomous Drone NavigationCode1
SOSBENCH: Benchmarking Safety Alignment on Scientific Knowledge0
Benchmarking Multimodal Knowledge Conflict for Large Multimodal ModelsCode1
AgentRecBench: Benchmarking LLM Agent-based Personalized Recommender Systems0
AMQA: An Adversarial Dataset for Benchmarking Bias of LLMs in Medicine and HealthcareCode0
MineAnyBuild: Benchmarking Spatial Planning for Open-world AI AgentsCode1
PathBench: A comprehensive comparison benchmark for pathology foundation models towards precision oncology0
TDVE-Assessor: Benchmarking and Evaluating the Quality of Text-Driven Video Editing with LMMs0
Benchmarking Large Multimodal Models for Ophthalmic Visual Question Answering with OphthalWeChat0
Beyond Specialization: Benchmarking LLMs for Transliteration of Indian Languages0
Calibrating Pre-trained Language Classifiers on LLM-generated Noisy Labels via Iterative RefinementCode0
A Unified Solution to Video Fusion: From Multi-Frame Learning to Benchmarking0
OB3D: A New Dataset for Benchmarking Omnidirectional 3D Reconstruction Using BlenderCode1
Automated Text-to-Table for Reasoning-Intensive Table QA: Pipeline Design and Benchmarking InsightsCode0
Synthetic Time Series Forecasting with Transformer Architectures: Extensive Simulation BenchmarksCode0
Transformers in Protein: A Survey0
Benchmarking and Enhancing LLM Agents in Localizing Linux Kernel BugsCode0
FinLoRA: Benchmarking LoRA Methods for Fine-Tuning LLMs on Financial Datasets0
EuroCon: Benchmarking Parliament Deliberation for Political Consensus Finding0
StructEval: Benchmarking LLMs' Capabilities to Generate Structural Outputs0
AssistedDS: Benchmarking How External Domain Knowledge Assists LLMs in Automated Data Science0
SeePhys: Does Seeing Help Thinking? -- Benchmarking Vision-Based Physics ReasoningCode1
DeepResearchGym: A Free, Transparent, and Reproducible Evaluation Sandbox for Deep Research0
SpokenNativQA: Multilingual Everyday Spoken Queries for LLMs0
Are Vision Language Models Ready for Clinical Diagnosis? A 3D Medical Benchmark for Tumor-centric Visual Question AnsweringCode1
Show:102550
← PrevPage 11 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified