SOTAVerified

Benchmarking

Papers

Showing 426450 of 5548 papers

TitleStatusHype
RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS EnvironmentsCode1
Bencher: Simple and Reproducible Benchmarking for Black-Box OptimizationCode1
FM-Planner: Foundation Model Guided Path Planning for Autonomous Drone NavigationCode1
Benchmarking Multimodal Knowledge Conflict for Large Multimodal ModelsCode1
OB3D: A New Dataset for Benchmarking Omnidirectional 3D Reconstruction Using BlenderCode1
MineAnyBuild: Benchmarking Spatial Planning for Open-world AI AgentsCode1
Are Vision Language Models Ready for Clinical Diagnosis? A 3D Medical Benchmark for Tumor-centric Visual Question AnsweringCode1
SeePhys: Does Seeing Help Thinking? -- Benchmarking Vision-Based Physics ReasoningCode1
FullFront: Benchmarking MLLMs Across the Full Front-End Engineering WorkflowCode1
Semantic Correspondence: Unified Benchmarking and a Strong BaselineCode1
Twin-2K-500: A dataset for building digital twins of over 2,000 people based on their answers to over 500 questionsCode1
Benchmarking Recommendation, Classification, and Tracing Based on Hugging Face Knowledge GraphCode1
REOBench: Benchmarking Robustness of Earth Observation Foundation ModelsCode1
EduBench: A Comprehensive Benchmarking Dataset for Evaluating Large Language Models in Diverse Educational ScenariosCode1
AGENTIF: Benchmarking Instruction Following of Large Language Models in Agentic ScenariosCode1
Benchmarking Retrieval-Augmented Multimomal Generation for Document Question AnsweringCode1
DiagnosisArena: Benchmarking Diagnostic Reasoning for Large Language ModelsCode1
TxPert: Leveraging Biochemical Relationships for Out-of-Distribution Transcriptomic Perturbation PredictionCode1
Ineq-Comp: Benchmarking Human-Intuitive Compositional Reasoning in Automated Theorem Proving on InequalitiesCode1
TimeSeriesGym: A Scalable Benchmark for (Time Series) Machine Learning Engineering AgentsCode1
Decentralized Arena: Towards Democratic and Scalable Automatic Evaluation of Language ModelsCode1
MedAgentBoard: Benchmarking Multi-Agent Collaboration with Conventional Methods for Diverse Medical TasksCode1
What are they talking about? Benchmarking Large Language Models for Knowledge-Grounded Discussion SummarizationCode1
LOVE: Benchmarking and Evaluating Text-to-Video Generation and Video-to-Text InterpretationCode1
Massive-STEPS: Massive Semantic Trajectories for Understanding POI Check-ins -- Dataset and BenchmarksCode1
Show:102550
← PrevPage 18 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified