SOTAVerified

Benchmarking

Papers

Showing 226250 of 5548 papers

TitleStatusHype
SustainDC: Benchmarking for Sustainable Data Center ControlCode2
COALA: A Practical and Vision-Centric Federated Learning PlatformCode2
MOMAland: A Set of Benchmarks for Multi-Objective Multi-Agent Reinforcement LearningCode2
Reliable and Efficient Concept Erasure of Text-to-Image Diffusion ModelsCode2
GV-Bench: Benchmarking Local Feature Matching for Geometric Verification of Long-term Loop Closure DetectionCode2
WayveScenes101: A Dataset and Benchmark for Novel View Synthesis in Autonomous DrivingCode2
InstructLayout: Instruction-Driven 2D and 3D Layout Synthesis with Semantic Graph PriorCode2
HumanRefiner: Benchmarking Abnormal Human Generation and Refining with Coarse-to-fine Pose-Reversible GuidanceCode2
SH17: A Dataset for Human Safety and Personal Protective Equipment Detection in Manufacturing IndustryCode2
Benchmarking Complex Instruction-Following with Multiple Constraints CompositionCode2
Craftium: An Extensible Framework for Creating Reinforcement Learning EnvironmentsCode2
CoIR: A Comprehensive Benchmark for Code Information Retrieval ModelsCode2
FairMedFM: Fairness Benchmarking for Medical Imaging Foundation ModelsCode2
Benchmarking Predictive Coding Networks -- Made SimpleCode2
MMLongBench-Doc: Benchmarking Long-context Document Understanding with VisualizationsCode2
UniGen: A Unified Framework for Textual Dataset Generation Using Large Language ModelsCode2
MathOdyssey: Benchmarking Mathematical Problem-Solving Skills in Large Language Models Using Odyssey Math DataCode2
GenRL: Multimodal-foundation world models for generalization in embodied agentsCode2
Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QACode2
From Perfect to Noisy World Simulation: Customizable Embodied Multi-modal Perturbations for SLAM Robustness BenchmarkingCode2
DreamBench++: A Human-Aligned Benchmark for Personalized Image GenerationCode2
FaceScore: Benchmarking and Enhancing Face Quality in Human GenerationCode2
Towards Open Respiratory Acoustic Foundation Models: Pretraining and BenchmarkingCode2
GenoTEX: An LLM Agent Benchmark for Automated Gene Expression Data AnalysisCode2
Benchmarking Uncertainty Quantification Methods for Large Language Models with LM-PolygraphCode2
Show:102550
← PrevPage 10 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified