SOTAVerified

Benchmarking

Papers

Showing 201225 of 5548 papers

TitleStatusHype
ODRL: A Benchmark for Off-Dynamics Reinforcement LearningCode2
CoqPilot, a plugin for LLM-based generation of proofsCode2
Open6DOR: Benchmarking Open-instruction 6-DoF Object Rearrangement and A VLM-based ApproachCode2
RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and StyleCode2
Multi-IF: Benchmarking LLMs on Multi-Turn and Multilingual Instructions FollowingCode2
SPA-Bench: A Comprehensive Benchmark for SmartPhone Agent EvaluationCode2
IntersectionZoo: Eco-driving for Benchmarking Multi-Agent Contextual Reinforcement LearningCode2
LLM-Based Multi-Agent Systems are Scalable Graph Generative ModelsCode2
COMPL-AI Framework: A Technical Interpretation and LLM Benchmarking Suite for the EU Artificial Intelligence ActCode2
Benchmarking Agentic Workflow GenerationCode2
Quanda: An Interpretability Toolkit for Training Data Attribution Evaluation and BeyondCode2
FedGraph: A Research Library and Benchmark for Federated Graph LearningCode2
MIBench: A Comprehensive Framework for Benchmarking Model Inversion Attack and DefenseCode2
dattri: A Library for Efficient Data AttributionCode2
AutoPenBench: Benchmarking Generative Agents for Penetration TestingCode2
Beyond Prompts: Dynamic Conversational Benchmarking of Large Language ModelsCode2
A Survey on Graph Neural Networks for Remaining Useful Life Prediction: Methodologies, Evaluation and Future TrendsCode2
GSplatLoc: Grounding Keypoint Descriptors into 3D Gaussian Splatting for Improved Visual LocalizationCode2
Small Language Models: Survey, Measurements, and InsightsCode2
A Survey on Multimodal Benchmarks: In the Era of Large AI ModelsCode2
Advances in APPFL: A Comprehensive and Extensible Federated Learning FrameworkCode2
Assessing SPARQL capabilities of Large Language ModelsCode2
PlantSeg: A Large-Scale In-the-wild Dataset for Plant Disease SegmentationCode2
Interactive Agents: Simulating Counselor-Client Psychological Counseling via Role-Playing LLM-to-LLM InteractionsCode2
PerturBench: Benchmarking Machine Learning Models for Cellular Perturbation AnalysisCode2
Show:102550
← PrevPage 9 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified