Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 351–375 of 5548 papers

Title	Date	Tasks	Status	Hype	Score
Customizable Perturbation Synthesis for Robust SLAM Benchmarking	Feb 12, 2024	BenchmarkingSimultaneous Localization and Mapping	CodeCode Available	2	5
A Survey on Graph Neural Networks for Remaining Useful Life Prediction: Methodologies, Evaluation and Future Trends	Sep 29, 2024	Benchmarkinggraph construction	CodeCode Available	2	5
CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and Interactions	May 24, 2025	Benchmarking	CodeCode Available	2	5
DaisyRec 2.0: Benchmarking Recommendation for Rigorous Evaluation	Jun 22, 2022	BenchmarkingRecommendation Systems	CodeCode Available	2	5
Craftium: An Extensible Framework for Creating Reinforcement Learning Environments	Jul 4, 2024	BenchmarkingMinecraft	CodeCode Available	2	5
Neptune: The Long Orbit to Benchmarking Long Video Understanding	Dec 12, 2024	BenchmarkingMultimodal Reasoning	CodeCode Available	2	5
Datasets and Benchmarks for Offline Safe Reinforcement Learning	Jun 15, 2023	Autonomous DrivingBenchmarking	CodeCode Available	2	5
EvalGIM: A Library for Evaluating Generative Image Models	Dec 13, 2024	BenchmarkingDiversity	CodeCode Available	2	5
Commit0: Library Generation from Scratch	Dec 2, 2024	BenchmarkingCode Generation	CodeCode Available	2	5
CoIR: A Comprehensive Benchmark for Code Information Retrieval Models	Jul 3, 2024	BenchmarkingCode Search	CodeCode Available	2	5
COMPL-AI Framework: A Technical Interpretation and LLM Benchmarking Suite for the EU Artificial Intelligence Act	Oct 10, 2024	BenchmarkingFairness	CodeCode Available	2	5
COALA: A Practical and Vision-Centric Federated Learning Platform	Jul 23, 2024	BenchmarkingContinual Learning	CodeCode Available	2	5
Benchmarking Potential Based Rewards for Learning Humanoid Locomotion	Jul 19, 2023	BenchmarkingReinforcement Learning (RL)	CodeCode Available	2	5
Benchmarking Complex Instruction-Following with Multiple Constraints Composition	Jul 4, 2024	BenchmarkingInstruction Following	CodeCode Available	2	5
CoqPilot, a plugin for LLM-based generation of proofs	Oct 25, 2024	Benchmarking	CodeCode Available	2	5
Benchmarking Benchmark Leakage in Large Language Models	Apr 29, 2024	BenchmarkingMathematical Reasoning	CodeCode Available	2	5
Class-incremental Learning for Time Series: Benchmark and Evaluation	Feb 19, 2024	Activity RecognitionBenchmarking	CodeCode Available	2	5
OpenOccupancy: A Large Scale Benchmark for Surrounding Semantic Occupancy Perception	Mar 7, 2023	Autonomous DrivingBenchmarking	CodeCode Available	2	5
ClimateLearn: Benchmarking Machine Learning for Weather and Climate Modeling	Jul 4, 2023	BenchmarkingWeather Forecasting	CodeCode Available	2	5
A Toolkit for Reliable Benchmarking and Research in Multi-Objective Reinforcement Learning	Sep 26, 2023	BenchmarkingMulti-Objective Reinforcement Learning	CodeCode Available	2	5
CORAL: Benchmarking Multi-turn Conversational Retrieval-Augmentation Generation	Oct 30, 2024	BenchmarkingPassage Retrieval	CodeCode Available	2	5
CausalGym: Benchmarking causal interpretability methods on linguistic tasks	Feb 19, 2024	BenchmarkingInterpretability Techniques for Deep Learning	CodeCode Available	2	5
Benchmarking and Improving Detail Image Caption	May 29, 2024	BenchmarkingImage Captioning	CodeCode Available	2	5
Challenges and Opportunities in Offline Reinforcement Learning from Visual Observations	Jun 9, 2022	Benchmarkingcontinuous-control	CodeCode Available	2	5
Building Normalizing Flows with Stochastic Interpolants	Sep 30, 2022	BenchmarkingDensity Estimation	CodeCode Available	2	5

Show:10 25 50

← PrevPage 15 of 222Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified