Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 201–225 of 5548 papers

Title	Date	Tasks	Status	Hype
AutoPenBench: Benchmarking Generative Agents for Penetration Testing	Oct 4, 2024	Benchmarking	CodeCode Available	2
GenoTEX: An LLM Agent Benchmark for Automated Gene Expression Data Analysis	Jun 21, 2024	AI AgentAutoML	CodeCode Available	2
Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs	Jun 13, 2024	BenchmarkingGPU	CodeCode Available	2
BARS: Towards Open Benchmarking for Recommender Systems	May 19, 2022	BenchmarkingClick-Through Rate Prediction	CodeCode Available	2
GPT-Fathom: Benchmarking Large Language Models to Decipher the Evolutionary Path towards GPT-4 and Beyond	Sep 28, 2023	Benchmarking	CodeCode Available	2
Griffin: Aerial-Ground Cooperative Detection and Tracking Dataset and Benchmark	Mar 10, 2025	Autonomous DrivingBenchmarking	CodeCode Available	2
Assessing SPARQL capabilities of Large Language Models	Sep 9, 2024	BenchmarkingKnowledge Graphs	CodeCode Available	2
A Survey on Multimodal Benchmarks: In the Era of Large AI Models	Sep 21, 2024	BenchmarkingSurvey	CodeCode Available	2
ADATIME: A Benchmarking Suite for Domain Adaptation on Time Series Data	Mar 15, 2022	BenchmarkingDomain Adaptation	CodeCode Available	2
HLSFactory: A Framework Empowering High-Level Synthesis Datasets for Machine Learning and Beyond	May 1, 2024	BenchmarkingHigh-Level Synthesis	CodeCode Available	2
HumanRefiner: Benchmarking Abnormal Human Generation and Refining with Coarse-to-fine Pose-Reversible Guidance	Jul 9, 2024	BenchmarkingConditional Image Generation	CodeCode Available	2
Evaluating Large-Vocabulary Object Detectors: The Devil is in the Details	Feb 1, 2021	Benchmarkingobject-detection	CodeCode Available	2
FedGraph: A Research Library and Benchmark for Federated Graph Learning	Oct 8, 2024	BenchmarkingFederated Learning	CodeCode Available	2
Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing	Apr 3, 2025	BenchmarkingLogical Reasoning	CodeCode Available	2
MultiPL-E: A Scalable and Extensible Approach to Benchmarking Neural Code Generation	Aug 17, 2022	BenchmarkingCode Generation	CodeCode Available	2
EQ-Bench: An Emotional Intelligence Benchmark for Large Language Models	Dec 11, 2023	BenchmarkingEmotional Intelligence	CodeCode Available	2
InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents	Mar 5, 2024	BenchmarkingLanguage Modeling	CodeCode Available	2
InjecGuard: Benchmarking and Mitigating Over-defense in Prompt Injection Guardrail Models	Oct 30, 2024	Benchmarking	CodeCode Available	2
Interaction2Code: Benchmarking MLLM-based Interactive Webpage Code Generation from Interactive Prototyping	Nov 5, 2024	BenchmarkingCode Generation	CodeCode Available	2
Interactive Agents: Simulating Counselor-Client Psychological Counseling via Role-Playing LLM-to-LLM Interactions	Aug 28, 2024	Benchmarking	CodeCode Available	2
IntersectionZoo: Eco-driving for Benchmarking Multi-Agent Contextual Reinforcement Learning	Oct 19, 2024	BenchmarkingMulti-agent Reinforcement Learning	CodeCode Available	2
IntPhys 2: Benchmarking Intuitive Physics Understanding In Complex Synthetic Environments	Jun 11, 2025	Benchmarking	CodeCode Available	2
A Dynamic Points Removal Benchmark in Point Cloud Maps	Jul 14, 2023	BenchmarkingDynamic Point Removal	CodeCode Available	2
Aria Digital Twin: A New Benchmark Dataset for Egocentric 3D Machine Perception	Jun 10, 2023	3D Object DetectionBenchmarking	CodeCode Available	2
EV2Gym: A Flexible V2G Simulator for EV Smart Charging Research and Benchmarking	Apr 2, 2024	BenchmarkingReinforcement Learning (RL)	CodeCode Available	2

Show:10 25 50

← PrevPage 9 of 222Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified