SOTAVerified

Benchmarking

Papers

Showing 351375 of 5548 papers

TitleStatusHype
Customizable Perturbation Synthesis for Robust SLAM BenchmarkingCode2
DaisyRec 2.0: Benchmarking Recommendation for Rigorous EvaluationCode2
EQ-Bench: An Emotional Intelligence Benchmark for Large Language ModelsCode2
CORAL: Benchmarking Multi-turn Conversational Retrieval-Augmentation GenerationCode2
CoqPilot, a plugin for LLM-based generation of proofsCode2
Multi-IF: Benchmarking LLMs on Multi-Turn and Multilingual Instructions FollowingCode2
GenRL: Multimodal-foundation world models for generalization in embodied agentsCode2
Commit0: Library Generation from ScratchCode2
CoIR: A Comprehensive Benchmark for Code Information Retrieval ModelsCode2
COMPL-AI Framework: A Technical Interpretation and LLM Benchmarking Suite for the EU Artificial Intelligence ActCode2
Needle In A Video Haystack: A Scalable Synthetic Evaluator for Video MLLMsCode2
Neptune: The Long Orbit to Benchmarking Long Video UnderstandingCode2
Craftium: An Extensible Framework for Creating Reinforcement Learning EnvironmentsCode2
ClimateLearn: Benchmarking Machine Learning for Weather and Climate ModelingCode2
COALA: A Practical and Vision-Centric Federated Learning PlatformCode2
Octopus: Embodied Vision-Language Programmer from Environmental FeedbackCode2
Are large language models superhuman chemists?Code2
Aria Digital Twin: A New Benchmark Dataset for Egocentric 3D Machine PerceptionCode2
Challenges and Opportunities in Offline Reinforcement Learning from Visual ObservationsCode2
CausalGym: Benchmarking causal interpretability methods on linguistic tasksCode2
Building Normalizing Flows with Stochastic InterpolantsCode2
OpenOccupancy: A Large Scale Benchmark for Surrounding Semantic Occupancy PerceptionCode2
BTS: Building Timeseries Dataset: Empowering Large-Scale Building AnalyticsCode2
Bringing Light Into the Dark: A Large-scale Evaluation of Knowledge Graph Embedding Models Under a Unified FrameworkCode2
A Toolkit for Reliable Benchmarking and Research in Multi-Objective Reinforcement LearningCode2
Show:102550
← PrevPage 15 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified