SOTAVerified

Benchmarking

Papers

Showing 201250 of 5548 papers

TitleStatusHype
ODRL: A Benchmark for Off-Dynamics Reinforcement LearningCode2
CoqPilot, a plugin for LLM-based generation of proofsCode2
Open6DOR: Benchmarking Open-instruction 6-DoF Object Rearrangement and A VLM-based ApproachCode2
RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and StyleCode2
Multi-IF: Benchmarking LLMs on Multi-Turn and Multilingual Instructions FollowingCode2
SPA-Bench: A Comprehensive Benchmark for SmartPhone Agent EvaluationCode2
IntersectionZoo: Eco-driving for Benchmarking Multi-Agent Contextual Reinforcement LearningCode2
LLM-Based Multi-Agent Systems are Scalable Graph Generative ModelsCode2
COMPL-AI Framework: A Technical Interpretation and LLM Benchmarking Suite for the EU Artificial Intelligence ActCode2
Benchmarking Agentic Workflow GenerationCode2
Quanda: An Interpretability Toolkit for Training Data Attribution Evaluation and BeyondCode2
FedGraph: A Research Library and Benchmark for Federated Graph LearningCode2
MIBench: A Comprehensive Framework for Benchmarking Model Inversion Attack and DefenseCode2
dattri: A Library for Efficient Data AttributionCode2
AutoPenBench: Benchmarking Generative Agents for Penetration TestingCode2
Beyond Prompts: Dynamic Conversational Benchmarking of Large Language ModelsCode2
A Survey on Graph Neural Networks for Remaining Useful Life Prediction: Methodologies, Evaluation and Future TrendsCode2
GSplatLoc: Grounding Keypoint Descriptors into 3D Gaussian Splatting for Improved Visual LocalizationCode2
Small Language Models: Survey, Measurements, and InsightsCode2
A Survey on Multimodal Benchmarks: In the Era of Large AI ModelsCode2
Advances in APPFL: A Comprehensive and Extensible Federated Learning FrameworkCode2
Assessing SPARQL capabilities of Large Language ModelsCode2
PlantSeg: A Large-Scale In-the-wild Dataset for Plant Disease SegmentationCode2
Interactive Agents: Simulating Counselor-Client Psychological Counseling via Role-Playing LLM-to-LLM InteractionsCode2
PerturBench: Benchmarking Machine Learning Models for Cellular Perturbation AnalysisCode2
SustainDC: Benchmarking for Sustainable Data Center ControlCode2
COALA: A Practical and Vision-Centric Federated Learning PlatformCode2
MOMAland: A Set of Benchmarks for Multi-Objective Multi-Agent Reinforcement LearningCode2
Reliable and Efficient Concept Erasure of Text-to-Image Diffusion ModelsCode2
GV-Bench: Benchmarking Local Feature Matching for Geometric Verification of Long-term Loop Closure DetectionCode2
WayveScenes101: A Dataset and Benchmark for Novel View Synthesis in Autonomous DrivingCode2
InstructLayout: Instruction-Driven 2D and 3D Layout Synthesis with Semantic Graph PriorCode2
HumanRefiner: Benchmarking Abnormal Human Generation and Refining with Coarse-to-fine Pose-Reversible GuidanceCode2
SH17: A Dataset for Human Safety and Personal Protective Equipment Detection in Manufacturing IndustryCode2
Benchmarking Complex Instruction-Following with Multiple Constraints CompositionCode2
Craftium: An Extensible Framework for Creating Reinforcement Learning EnvironmentsCode2
CoIR: A Comprehensive Benchmark for Code Information Retrieval ModelsCode2
FairMedFM: Fairness Benchmarking for Medical Imaging Foundation ModelsCode2
Benchmarking Predictive Coding Networks -- Made SimpleCode2
MMLongBench-Doc: Benchmarking Long-context Document Understanding with VisualizationsCode2
UniGen: A Unified Framework for Textual Dataset Generation Using Large Language ModelsCode2
MathOdyssey: Benchmarking Mathematical Problem-Solving Skills in Large Language Models Using Odyssey Math DataCode2
GenRL: Multimodal-foundation world models for generalization in embodied agentsCode2
Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QACode2
From Perfect to Noisy World Simulation: Customizable Embodied Multi-modal Perturbations for SLAM Robustness BenchmarkingCode2
DreamBench++: A Human-Aligned Benchmark for Personalized Image GenerationCode2
FaceScore: Benchmarking and Enhancing Face Quality in Human GenerationCode2
Towards Open Respiratory Acoustic Foundation Models: Pretraining and BenchmarkingCode2
GenoTEX: An LLM Agent Benchmark for Automated Gene Expression Data AnalysisCode2
Benchmarking Uncertainty Quantification Methods for Large Language Models with LM-PolygraphCode2
Show:102550
← PrevPage 5 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified