Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 201–225 of 5548 papers

Title	Date	Tasks	Status	Hype
ODRL: A Benchmark for Off-Dynamics Reinforcement Learning	Oct 28, 2024	Benchmarkingreinforcement-learning	CodeCode Available	2
CoqPilot, a plugin for LLM-based generation of proofs	Oct 25, 2024	Benchmarking	CodeCode Available	2
Open6DOR: Benchmarking Open-instruction 6-DoF Object Rearrangement and A VLM-based Approach	Oct 24, 2024	BenchmarkingInstruction Following	CodeCode Available	2
RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style	Oct 21, 2024	BenchmarkingLanguage Modeling	CodeCode Available	2
Multi-IF: Benchmarking LLMs on Multi-Turn and Multilingual Instructions Following	Oct 21, 2024	BenchmarkingInstruction Following	CodeCode Available	2
SPA-Bench: A Comprehensive Benchmark for SmartPhone Agent Evaluation	Oct 19, 2024	AI AgentBenchmarking	CodeCode Available	2
IntersectionZoo: Eco-driving for Benchmarking Multi-Agent Contextual Reinforcement Learning	Oct 19, 2024	BenchmarkingMulti-agent Reinforcement Learning	CodeCode Available	2
LLM-Based Multi-Agent Systems are Scalable Graph Generative Models	Oct 13, 2024	BenchmarkingGraph Generation	CodeCode Available	2
COMPL-AI Framework: A Technical Interpretation and LLM Benchmarking Suite for the EU Artificial Intelligence Act	Oct 10, 2024	BenchmarkingFairness	CodeCode Available	2
Benchmarking Agentic Workflow Generation	Oct 10, 2024	Benchmarking	CodeCode Available	2
Quanda: An Interpretability Toolkit for Training Data Attribution Evaluation and Beyond	Oct 9, 2024	Benchmarking	CodeCode Available	2
FedGraph: A Research Library and Benchmark for Federated Graph Learning	Oct 8, 2024	BenchmarkingFederated Learning	CodeCode Available	2
MIBench: A Comprehensive Framework for Benchmarking Model Inversion Attack and Defense	Oct 7, 2024	Adversarial RobustnessBenchmarking	CodeCode Available	2
dattri: A Library for Efficient Data Attribution	Oct 6, 2024	Benchmarking	CodeCode Available	2
AutoPenBench: Benchmarking Generative Agents for Penetration Testing	Oct 4, 2024	Benchmarking	CodeCode Available	2
Beyond Prompts: Dynamic Conversational Benchmarking of Large Language Models	Sep 30, 2024	BenchmarkingContinual Learning	CodeCode Available	2
A Survey on Graph Neural Networks for Remaining Useful Life Prediction: Methodologies, Evaluation and Future Trends	Sep 29, 2024	Benchmarkinggraph construction	CodeCode Available	2
GSplatLoc: Grounding Keypoint Descriptors into 3D Gaussian Splatting for Improved Visual Localization	Sep 24, 2024	3D geometry3DGS	CodeCode Available	2
Small Language Models: Survey, Measurements, and Insights	Sep 24, 2024	BenchmarkingDecoder	CodeCode Available	2
A Survey on Multimodal Benchmarks: In the Era of Large AI Models	Sep 21, 2024	BenchmarkingSurvey	CodeCode Available	2
Advances in APPFL: A Comprehensive and Extensible Federated Learning Framework	Sep 17, 2024	BenchmarkingFederated Learning	CodeCode Available	2
Assessing SPARQL capabilities of Large Language Models	Sep 9, 2024	BenchmarkingKnowledge Graphs	CodeCode Available	2
PlantSeg: A Large-Scale In-the-wild Dataset for Plant Disease Segmentation	Sep 6, 2024	Benchmarkingimage-classification	CodeCode Available	2
Interactive Agents: Simulating Counselor-Client Psychological Counseling via Role-Playing LLM-to-LLM Interactions	Aug 28, 2024	Benchmarking	CodeCode Available	2
PerturBench: Benchmarking Machine Learning Models for Cellular Perturbation Analysis	Aug 20, 2024	Benchmarking	CodeCode Available	2

Show:10 25 50

← PrevPage 9 of 222Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified