SOTAVerified

Benchmarking

Papers

Showing 351400 of 5548 papers

TitleStatusHype
EffiBench: Benchmarking the Efficiency of Automatically Generated CodeCode2
GenRL: Multimodal-foundation world models for generalization in embodied agentsCode2
Multitask Prompted Training Enables Zero-Shot Task GeneralizationCode2
EvalGIM: A Library for Evaluating Generative Image ModelsCode2
MLAgentBench: Evaluating Language Agents on Machine Learning ExperimentationCode2
Benchmarking Large Language Models in Retrieval-Augmented GenerationCode2
DrafterBench: Benchmarking Large Language Models for Tasks Automation in Civil EngineeringCode2
DreamBench++: A Human-Aligned Benchmark for Personalized Image GenerationCode2
Desbordante: from benchmarking suite to high-performance science-intensive data profiler (preprint)Code2
A Survey on Graph Neural Networks for Remaining Useful Life Prediction: Methodologies, Evaluation and Future TrendsCode2
A Survey on Multimodal Benchmarks: In the Era of Large AI ModelsCode2
A Toolkit for Reliable Benchmarking and Research in Multi-Objective Reinforcement LearningCode2
Deep Visual Geo-localization BenchmarkCode2
LLM-Based Multi-Agent Systems are Scalable Graph Generative ModelsCode2
Evaluating Large-Vocabulary Object Detectors: The Devil is in the DetailsCode2
DaisyRec 2.0: Benchmarking Recommendation for Rigorous EvaluationCode2
Assessing SPARQL capabilities of Large Language ModelsCode2
Datasets and Benchmarks for Offline Safe Reinforcement LearningCode2
Open Universal Arabic ASR LeaderboardCode2
CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and InteractionsCode2
An OpenMind for 3D medical vision self-supervised learningCode2
MultiPL-E: A Scalable and Extensible Approach to Benchmarking Neural Code GenerationCode2
PEDANTS: Cheap but Effective and Interpretable Answer EquivalenceCode2
Customizable Perturbation Synthesis for Robust SLAM BenchmarkingCode2
Decouple and Track: Benchmarking and Improving Video Diffusion Transformers for Motion TransferCode2
Aria Digital Twin: A New Benchmark Dataset for Egocentric 3D Machine PerceptionCode2
PG-Video-LLaVA: Pixel Grounding Large Video-Language ModelsCode2
Craftium: An Extensible Framework for Creating Reinforcement Learning EnvironmentsCode2
CoqPilot, a plugin for LLM-based generation of proofsCode2
CORAL: Benchmarking Multi-turn Conversational Retrieval-Augmentation GenerationCode2
Benchmarking Neural Network Robustness to Common Corruptions and PerturbationsCode2
PolygloToxicityPrompts: Multilingual Evaluation of Neural Toxic Degeneration in Large Language ModelsCode2
CoIR: A Comprehensive Benchmark for Code Information Retrieval ModelsCode2
Are large language models superhuman chemists?Code2
ClimateLearn: Benchmarking Machine Learning for Weather and Climate ModelingCode2
COALA: A Practical and Vision-Centric Federated Learning PlatformCode2
Commit0: Library Generation from ScratchCode2
Class-incremental Learning for Time Series: Benchmark and EvaluationCode2
Challenges and Opportunities in Offline Reinforcement Learning from Visual ObservationsCode2
COMPL-AI Framework: A Technical Interpretation and LLM Benchmarking Suite for the EU Artificial Intelligence ActCode2
Event-Based Motion MagnificationCode2
Benchmarking Robustness of 3D Point Cloud Recognition Against Common CorruptionsCode2
Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A BenchmarkCode2
R-Judge: Benchmarking Safety Risk Awareness for LLM AgentsCode2
iSEARLE: Improving Textual Inversion for Zero-Shot Composed Image RetrievalCode2
Quanda: An Interpretability Toolkit for Training Data Attribution Evaluation and BeyondCode2
COSMOS: Catching Out-of-Context Misinformation with Self-Supervised LearningCode1
Category-wise Fine-Tuning: Resisting Incorrect Pseudo-Labels in Multi-Label Image Classification with Partial LabelsCode1
RADAR: Benchmarking Language Models on Imperfect Tabular DataCode1
APTv2: Benchmarking Animal Pose Estimation and Tracking with a Large-scale Dataset and BeyondCode1
Show:102550
← PrevPage 8 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified