Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 201–250 of 5548 papers

Title	Date	Tasks	Status	Hype
ODRL: A Benchmark for Off-Dynamics Reinforcement Learning	Oct 28, 2024	Benchmarkingreinforcement-learning	CodeCode Available	2
CoqPilot, a plugin for LLM-based generation of proofs	Oct 25, 2024	Benchmarking	CodeCode Available	2
Open6DOR: Benchmarking Open-instruction 6-DoF Object Rearrangement and A VLM-based Approach	Oct 24, 2024	BenchmarkingInstruction Following	CodeCode Available	2
RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style	Oct 21, 2024	BenchmarkingLanguage Modeling	CodeCode Available	2
Multi-IF: Benchmarking LLMs on Multi-Turn and Multilingual Instructions Following	Oct 21, 2024	BenchmarkingInstruction Following	CodeCode Available	2
SPA-Bench: A Comprehensive Benchmark for SmartPhone Agent Evaluation	Oct 19, 2024	AI AgentBenchmarking	CodeCode Available	2
IntersectionZoo: Eco-driving for Benchmarking Multi-Agent Contextual Reinforcement Learning	Oct 19, 2024	BenchmarkingMulti-agent Reinforcement Learning	CodeCode Available	2
LLM-Based Multi-Agent Systems are Scalable Graph Generative Models	Oct 13, 2024	BenchmarkingGraph Generation	CodeCode Available	2
COMPL-AI Framework: A Technical Interpretation and LLM Benchmarking Suite for the EU Artificial Intelligence Act	Oct 10, 2024	BenchmarkingFairness	CodeCode Available	2
Benchmarking Agentic Workflow Generation	Oct 10, 2024	Benchmarking	CodeCode Available	2
Quanda: An Interpretability Toolkit for Training Data Attribution Evaluation and Beyond	Oct 9, 2024	Benchmarking	CodeCode Available	2
FedGraph: A Research Library and Benchmark for Federated Graph Learning	Oct 8, 2024	BenchmarkingFederated Learning	CodeCode Available	2
MIBench: A Comprehensive Framework for Benchmarking Model Inversion Attack and Defense	Oct 7, 2024	Adversarial RobustnessBenchmarking	CodeCode Available	2
dattri: A Library for Efficient Data Attribution	Oct 6, 2024	Benchmarking	CodeCode Available	2
AutoPenBench: Benchmarking Generative Agents for Penetration Testing	Oct 4, 2024	Benchmarking	CodeCode Available	2
Beyond Prompts: Dynamic Conversational Benchmarking of Large Language Models	Sep 30, 2024	BenchmarkingContinual Learning	CodeCode Available	2
A Survey on Graph Neural Networks for Remaining Useful Life Prediction: Methodologies, Evaluation and Future Trends	Sep 29, 2024	Benchmarkinggraph construction	CodeCode Available	2
GSplatLoc: Grounding Keypoint Descriptors into 3D Gaussian Splatting for Improved Visual Localization	Sep 24, 2024	3D geometry3DGS	CodeCode Available	2
Small Language Models: Survey, Measurements, and Insights	Sep 24, 2024	BenchmarkingDecoder	CodeCode Available	2
A Survey on Multimodal Benchmarks: In the Era of Large AI Models	Sep 21, 2024	BenchmarkingSurvey	CodeCode Available	2
Advances in APPFL: A Comprehensive and Extensible Federated Learning Framework	Sep 17, 2024	BenchmarkingFederated Learning	CodeCode Available	2
Assessing SPARQL capabilities of Large Language Models	Sep 9, 2024	BenchmarkingKnowledge Graphs	CodeCode Available	2
PlantSeg: A Large-Scale In-the-wild Dataset for Plant Disease Segmentation	Sep 6, 2024	Benchmarkingimage-classification	CodeCode Available	2
Interactive Agents: Simulating Counselor-Client Psychological Counseling via Role-Playing LLM-to-LLM Interactions	Aug 28, 2024	Benchmarking	CodeCode Available	2
PerturBench: Benchmarking Machine Learning Models for Cellular Perturbation Analysis	Aug 20, 2024	Benchmarking	CodeCode Available	2
SustainDC: Benchmarking for Sustainable Data Center Control	Aug 14, 2024	BenchmarkingManagement	CodeCode Available	2
COALA: A Practical and Vision-Centric Federated Learning Platform	Jul 23, 2024	BenchmarkingContinual Learning	CodeCode Available	2
MOMAland: A Set of Benchmarks for Multi-Objective Multi-Agent Reinforcement Learning	Jul 23, 2024	BenchmarkingDecision Making	CodeCode Available	2
Reliable and Efficient Concept Erasure of Text-to-Image Diffusion Models	Jul 17, 2024	BenchmarkingRed Teaming	CodeCode Available	2
GV-Bench: Benchmarking Local Feature Matching for Geometric Verification of Long-term Loop Closure Detection	Jul 16, 2024	BenchmarkingLoop Closure Detection	CodeCode Available	2
WayveScenes101: A Dataset and Benchmark for Novel View Synthesis in Autonomous Driving	Jul 11, 2024	Autonomous DrivingBenchmarking	CodeCode Available	2
InstructLayout: Instruction-Driven 2D and 3D Layout Synthesis with Semantic Graph Prior	Jul 10, 2024	BenchmarkingDecoder	CodeCode Available	2
HumanRefiner: Benchmarking Abnormal Human Generation and Refining with Coarse-to-fine Pose-Reversible Guidance	Jul 9, 2024	BenchmarkingConditional Image Generation	CodeCode Available	2
SH17: A Dataset for Human Safety and Personal Protective Equipment Detection in Manufacturing Industry	Jul 5, 2024	Benchmarkingobject-detection	CodeCode Available	2
Benchmarking Complex Instruction-Following with Multiple Constraints Composition	Jul 4, 2024	BenchmarkingInstruction Following	CodeCode Available	2
Craftium: An Extensible Framework for Creating Reinforcement Learning Environments	Jul 4, 2024	BenchmarkingMinecraft	CodeCode Available	2
CoIR: A Comprehensive Benchmark for Code Information Retrieval Models	Jul 3, 2024	BenchmarkingCode Search	CodeCode Available	2
FairMedFM: Fairness Benchmarking for Medical Imaging Foundation Models	Jul 1, 2024	BenchmarkingFairness	CodeCode Available	2
Benchmarking Predictive Coding Networks -- Made Simple	Jul 1, 2024	Benchmarking	CodeCode Available	2
MMLongBench-Doc: Benchmarking Long-context Document Understanding with Visualizations	Jul 1, 2024	Benchmarkingdocument understanding	CodeCode Available	2
UniGen: A Unified Framework for Textual Dataset Generation Using Large Language Models	Jun 27, 2024	AttributeBenchmarking	CodeCode Available	2
MathOdyssey: Benchmarking Mathematical Problem-Solving Skills in Large Language Models Using Odyssey Math Data	Jun 26, 2024	BenchmarkingMath	CodeCode Available	2
GenRL: Multimodal-foundation world models for generalization in embodied agents	Jun 26, 2024	BenchmarkingReinforcement Learning (RL)	CodeCode Available	2
Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA	Jun 25, 2024	BenchmarkingLong-Context Understanding	CodeCode Available	2
From Perfect to Noisy World Simulation: Customizable Embodied Multi-modal Perturbations for SLAM Robustness Benchmarking	Jun 24, 2024	BenchmarkingNeRF	CodeCode Available	2
DreamBench++: A Human-Aligned Benchmark for Personalized Image Generation	Jun 24, 2024	BenchmarkingImage Generation	CodeCode Available	2
FaceScore: Benchmarking and Enhancing Face Quality in Human Generation	Jun 24, 2024	BenchmarkingDenoising	CodeCode Available	2
Towards Open Respiratory Acoustic Foundation Models: Pretraining and Benchmarking	Jun 23, 2024	Benchmarking	CodeCode Available	2
GenoTEX: An LLM Agent Benchmark for Automated Gene Expression Data Analysis	Jun 21, 2024	AI AgentAutoML	CodeCode Available	2
Benchmarking Uncertainty Quantification Methods for Large Language Models with LM-Polygraph	Jun 21, 2024	BenchmarkingText Generation	CodeCode Available	2

Show:10 25 50

← PrevPage 5 of 111Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified