SOTAVerified

Benchmarking

Papers

Showing 351400 of 5548 papers

TitleStatusHype
LLM-Based Multi-Agent Systems are Scalable Graph Generative ModelsCode2
DrafterBench: Benchmarking Large Language Models for Tasks Automation in Civil EngineeringCode2
State-specific protein-ligand complex structure prediction with a multi-scale deep generative modelCode2
Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual EditingCode2
A Survey on Multimodal Benchmarks: In the Era of Large AI ModelsCode2
FedGraph: A Research Library and Benchmark for Federated Graph LearningCode2
Decouple and Track: Benchmarking and Improving Video Diffusion Transformers for Motion TransferCode2
DaisyRec 2.0: Benchmarking Recommendation for Rigorous EvaluationCode2
Benchmarking Deep Reinforcement Learning for Continuous ControlCode2
Datasets and Benchmarks for Offline Safe Reinforcement LearningCode2
Deep Visual Geo-localization BenchmarkCode2
Craftium: An Extensible Framework for Creating Reinforcement Learning EnvironmentsCode2
A Toolkit for Reliable Benchmarking and Research in Multi-Objective Reinforcement LearningCode2
CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and InteractionsCode2
Benchmarking Robustness of 3D Point Cloud Recognition Against Common CorruptionsCode2
OpenFactCheck: Building, Benchmarking Customized Fact-Checking Systems and Evaluating the Factuality of Claims and LLMsCode2
Open Universal Arabic ASR LeaderboardCode2
CORAL: Benchmarking Multi-turn Conversational Retrieval-Augmentation GenerationCode2
CoqPilot, a plugin for LLM-based generation of proofsCode2
OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?Code2
Customizable Perturbation Synthesis for Robust SLAM BenchmarkingCode2
Desbordante: from benchmarking suite to high-performance science-intensive data profiler (preprint)Code2
CoIR: A Comprehensive Benchmark for Code Information Retrieval ModelsCode2
COALA: A Practical and Vision-Centric Federated Learning PlatformCode2
Authorship Obfuscation in Multilingual Machine-Generated Text DetectionCode2
ClimateLearn: Benchmarking Machine Learning for Weather and Climate ModelingCode2
Benchmarking Synthetic Tabular Data: A Multi-Dimensional Evaluation FrameworkCode2
PINNacle: A Comprehensive Benchmark of Physics-Informed Neural Networks for Solving PDEsCode2
PocketVina Enables Scalable and Highly Accurate Physically Valid Docking through Multi-Pocket ConditioningCode2
PolygloToxicityPrompts: Multilingual Evaluation of Neural Toxic Degeneration in Large Language ModelsCode2
Commit0: Library Generation from ScratchCode2
ProbTS: Benchmarking Point and Distributional Forecasting across Diverse Prediction HorizonsCode2
Benchmarking Complex Instruction-Following with Multiple Constraints CompositionCode2
Class-incremental Learning for Time Series: Benchmark and EvaluationCode2
Challenges and Opportunities in Offline Reinforcement Learning from Visual ObservationsCode2
COMPL-AI Framework: A Technical Interpretation and LLM Benchmarking Suite for the EU Artificial Intelligence ActCode2
Benchmarking the Robustness of LiDAR Semantic Segmentation ModelsCode2
Reliable and Efficient Concept Erasure of Text-to-Image Diffusion ModelsCode2
Revealing data leakage in protein interaction benchmarksCode2
FetalCLIP: A Visual-Language Foundation Model for Fetal Ultrasound Image AnalysisCode2
Learning to Fly -- a Gym Environment with PyBullet Physics for Reinforcement Learning of Multi-agent Quadcopter ControlCode2
RoboPianist: Dexterous Piano Playing with Deep Reinforcement LearningCode2
REAL-Colon: A dataset for developing real-world AI applications in colonoscopyCode2
Benchmarking Uncertainty Quantification Methods for Large Language Models with LM-PolygraphCode2
BARS: Towards Open Benchmarking for Recommender SystemsCode2
Benchmarking Zero-shot Text Classification: Datasets, Evaluation and Entailment ApproachCode2
COSMOS: Catching Out-of-Context Misinformation with Self-Supervised LearningCode1
Category-wise Fine-Tuning: Resisting Incorrect Pseudo-Labels in Multi-Label Image Classification with Partial LabelsCode1
RADAR: Benchmarking Language Models on Imperfect Tabular DataCode1
Benchmarking Bias Mitigation Algorithms in Representation Learning through Fairness MetricsCode1
Show:102550
← PrevPage 8 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified