Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 351–375 of 5548 papers

Title	Date	Tasks	Status	Hype
Customizable Perturbation Synthesis for Robust SLAM Benchmarking	Feb 12, 2024	BenchmarkingSimultaneous Localization and Mapping	CodeCode Available	2
DaisyRec 2.0: Benchmarking Recommendation for Rigorous Evaluation	Jun 22, 2022	BenchmarkingRecommendation Systems	CodeCode Available	2
EQ-Bench: An Emotional Intelligence Benchmark for Large Language Models	Dec 11, 2023	BenchmarkingEmotional Intelligence	CodeCode Available	2
CORAL: Benchmarking Multi-turn Conversational Retrieval-Augmentation Generation	Oct 30, 2024	BenchmarkingPassage Retrieval	CodeCode Available	2
CoqPilot, a plugin for LLM-based generation of proofs	Oct 25, 2024	Benchmarking	CodeCode Available	2
Multi-IF: Benchmarking LLMs on Multi-Turn and Multilingual Instructions Following	Oct 21, 2024	BenchmarkingInstruction Following	CodeCode Available	2
GenRL: Multimodal-foundation world models for generalization in embodied agents	Jun 26, 2024	BenchmarkingReinforcement Learning (RL)	CodeCode Available	2
Commit0: Library Generation from Scratch	Dec 2, 2024	BenchmarkingCode Generation	CodeCode Available	2
CoIR: A Comprehensive Benchmark for Code Information Retrieval Models	Jul 3, 2024	BenchmarkingCode Search	CodeCode Available	2
COMPL-AI Framework: A Technical Interpretation and LLM Benchmarking Suite for the EU Artificial Intelligence Act	Oct 10, 2024	BenchmarkingFairness	CodeCode Available	2
Needle In A Video Haystack: A Scalable Synthetic Evaluator for Video MLLMs	Jun 13, 2024	BenchmarkingQuestion Answering	CodeCode Available	2
Neptune: The Long Orbit to Benchmarking Long Video Understanding	Dec 12, 2024	BenchmarkingMultimodal Reasoning	CodeCode Available	2
Craftium: An Extensible Framework for Creating Reinforcement Learning Environments	Jul 4, 2024	BenchmarkingMinecraft	CodeCode Available	2
ClimateLearn: Benchmarking Machine Learning for Weather and Climate Modeling	Jul 4, 2023	BenchmarkingWeather Forecasting	CodeCode Available	2
COALA: A Practical and Vision-Centric Federated Learning Platform	Jul 23, 2024	BenchmarkingContinual Learning	CodeCode Available	2
Octopus: Embodied Vision-Language Programmer from Environmental Feedback	Oct 12, 2023	BenchmarkingCode Generation	CodeCode Available	2
Are large language models superhuman chemists?	Apr 1, 2024	Benchmarking	CodeCode Available	2
Aria Digital Twin: A New Benchmark Dataset for Egocentric 3D Machine Perception	Jun 10, 2023	3D Object DetectionBenchmarking	CodeCode Available	2
Challenges and Opportunities in Offline Reinforcement Learning from Visual Observations	Jun 9, 2022	Benchmarkingcontinuous-control	CodeCode Available	2
CausalGym: Benchmarking causal interpretability methods on linguistic tasks	Feb 19, 2024	BenchmarkingInterpretability Techniques for Deep Learning	CodeCode Available	2
Building Normalizing Flows with Stochastic Interpolants	Sep 30, 2022	BenchmarkingDensity Estimation	CodeCode Available	2
OpenOccupancy: A Large Scale Benchmark for Surrounding Semantic Occupancy Perception	Mar 7, 2023	Autonomous DrivingBenchmarking	CodeCode Available	2
BTS: Building Timeseries Dataset: Empowering Large-Scale Building Analytics	Jun 13, 2024	Benchmarking	CodeCode Available	2
Bringing Light Into the Dark: A Large-scale Evaluation of Knowledge Graph Embedding Models Under a Unified Framework	Jun 23, 2020	BenchmarkingGPU	CodeCode Available	2
A Toolkit for Reliable Benchmarking and Research in Multi-Objective Reinforcement Learning	Sep 26, 2023	BenchmarkingMulti-Objective Reinforcement Learning	CodeCode Available	2

Show:10 25 50

← PrevPage 15 of 222Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified