Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 151–200 of 5548 papers

Title	Date	Tasks	Status	Hype
Immersive Neural Graphics Primitives	Nov 24, 2022	BenchmarkingNeRF	CodeCode Available	2
IML-ViT: Benchmarking Image Manipulation Localization by Vision Transformer	Jul 27, 2023	BenchmarkingImage Manipulation	CodeCode Available	2
InfiAgent-DABench: Evaluating Agents on Data Analysis Tasks	Jan 10, 2024	Benchmarking	CodeCode Available	2
A Call to Reflect on Evaluation Practices for Age Estimation: Comparative Analysis of the State-of-the-Art and a Unified Benchmark	Jan 1, 2024	Age EstimationBenchmarking	CodeCode Available	2
InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents	Mar 5, 2024	BenchmarkingLanguage Modeling	CodeCode Available	2
HumanRefiner: Benchmarking Abnormal Human Generation and Refining with Coarse-to-fine Pose-Reversible Guidance	Jul 9, 2024	BenchmarkingConditional Image Generation	CodeCode Available	2
HourVideo: 1-Hour Video-Language Understanding	Nov 7, 2024	Benchmarkingcounterfactual	CodeCode Available	2
Battle of the Backbones: A Large-Scale Comparison of Pretrained Models across Computer Vision Tasks	Oct 30, 2023	Benchmarkingobject-detection	CodeCode Available	2
A Dynamic Points Removal Benchmark in Point Cloud Maps	Jul 14, 2023	BenchmarkingDynamic Point Removal	CodeCode Available	2
BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models	Apr 17, 2021	Argument RetrievalBenchmarking	CodeCode Available	2
HypoBench: Towards Systematic and Principled Benchmarking for Hypothesis Generation	Apr 15, 2025	Benchmarkingscientific discovery	CodeCode Available	2
InjecGuard: Benchmarking and Mitigating Over-defense in Prompt Injection Guardrail Models	Oct 30, 2024	Benchmarking	CodeCode Available	2
K-LITE: Learning Transferable Visual Models with External Knowledge	Apr 20, 2022	BenchmarkingDescriptive	CodeCode Available	2
Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs	Jun 13, 2024	BenchmarkingGPU	CodeCode Available	2
AutoTrust: Benchmarking Trustworthiness in Large Vision Language Models for Autonomous Driving	Dec 19, 2024	Autonomous DrivingBenchmarking	CodeCode Available	2
AutoPenBench: Benchmarking Generative Agents for Penetration Testing	Oct 4, 2024	Benchmarking	CodeCode Available	2
Habitat: A Platform for Embodied AI Research	Apr 2, 2019	BenchmarkingGPU	CodeCode Available	2
GSCodec Studio: A Modular Framework for Gaussian Splat Compression	Jun 2, 2025	Benchmarking	CodeCode Available	2
GSplatLoc: Grounding Keypoint Descriptors into 3D Gaussian Splatting for Improved Visual Localization	Sep 24, 2024	3D geometry3DGS	CodeCode Available	2
GPT-Fathom: Benchmarking Large Language Models to Decipher the Evolutionary Path towards GPT-4 and Beyond	Sep 28, 2023	Benchmarking	CodeCode Available	2
BARS: Towards Open Benchmarking for Recommender Systems	May 19, 2022	BenchmarkingClick-Through Rate Prediction	CodeCode Available	2
Griffin: Aerial-Ground Cooperative Detection and Tracking Dataset and Benchmark	Mar 10, 2025	Autonomous DrivingBenchmarking	CodeCode Available	2
GV-Bench: Benchmarking Local Feature Matching for Geometric Verification of Long-term Loop Closure Detection	Jul 16, 2024	BenchmarkingLoop Closure Detection	CodeCode Available	2
HLSFactory: A Framework Empowering High-Level Synthesis Datasets for Machine Learning and Beyond	May 1, 2024	BenchmarkingHigh-Level Synthesis	CodeCode Available	2
GenoTEX: An LLM Agent Benchmark for Automated Gene Expression Data Analysis	Jun 21, 2024	AI AgentAutoML	CodeCode Available	2
Authorship Obfuscation in Multilingual Machine-Generated Text Detection	Jan 15, 2024	Adversarial RobustnessBenchmarking	CodeCode Available	2
GeoBench: Benchmarking and Analyzing Monocular Geometry Estimation Models	Jun 18, 2024	BenchmarkingDepth Estimation	CodeCode Available	2
GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks	Nov 28, 2024	BenchmarkingObject Counting	CodeCode Available	2
From Perfect to Noisy World Simulation: Customizable Embodied Multi-modal Perturbations for SLAM Robustness Benchmarking	Jun 24, 2024	BenchmarkingNeRF	CodeCode Available	2
Fortuna: A Library for Uncertainty Quantification in Deep Learning	Feb 8, 2023	Bayesian InferenceBenchmarking	CodeCode Available	2
GDGB: A Benchmark for Generative Dynamic Text-Attributed Graph Learning	Jul 4, 2025	BenchmarkingGraph Generation	CodeCode Available	2
A Survey on Graph Neural Networks for Remaining Useful Life Prediction: Methodologies, Evaluation and Future Trends	Sep 29, 2024	Benchmarkinggraph construction	CodeCode Available	2
A Survey on Multimodal Benchmarks: In the Era of Large AI Models	Sep 21, 2024	BenchmarkingSurvey	CodeCode Available	2
Fino1: On the Transferability of Reasoning Enhanced LLMs to Finance	Feb 12, 2025	BenchmarkingLong-Context Understanding	CodeCode Available	2
GlobalGeoTree: A Multi-Granular Vision-Language Dataset for Global Tree Species Classification	May 18, 2025	Benchmarking	CodeCode Available	2
HoTPP Benchmark: Are We Good at the Long Horizon Events Forecasting?	Jun 20, 2024	BenchmarkingPoint Processes	CodeCode Available	2
Know Me, Respond to Me: Benchmarking LLMs for Dynamic User Profiling and Personalized Responses at Scale	Apr 19, 2025	Benchmarking	CodeCode Available	2
MultiPL-E: A Scalable and Extensible Approach to Benchmarking Neural Code Generation	Aug 17, 2022	BenchmarkingCode Generation	CodeCode Available	2
Exponentially Faster Language Modelling	Nov 15, 2023	BenchmarkingCPU	CodeCode Available	2
Extended Agriculture-Vision: An Extension of a Large Aerial Image Dataset for Agricultural Pattern Analysis	Mar 4, 2023	BenchmarkingContrastive Learning	CodeCode Available	2
Event-Based Motion Magnification	Feb 19, 2024	BenchmarkingMotion Detection	CodeCode Available	2
FairMedFM: Fairness Benchmarking for Medical Imaging Foundation Models	Jul 1, 2024	BenchmarkingFairness	CodeCode Available	2
EV2Gym: A Flexible V2G Simulator for EV Smart Charging Research and Benchmarking	Apr 2, 2024	BenchmarkingReinforcement Learning (RL)	CodeCode Available	2
Aria Digital Twin: A New Benchmark Dataset for Egocentric 3D Machine Perception	Jun 10, 2023	3D Object DetectionBenchmarking	CodeCode Available	2
FluidLab: A Differentiable Environment for Benchmarking Complex Fluid Manipulation	Mar 4, 2023	BenchmarkingGPU	CodeCode Available	2
FormalMATH: Benchmarking Formal Mathematical Reasoning of Large Language Models	May 5, 2025	BenchmarkingMathematical Reasoning	CodeCode Available	2
A Toolkit for Reliable Benchmarking and Research in Multi-Objective Reinforcement Learning	Sep 26, 2023	BenchmarkingMulti-Objective Reinforcement Learning	CodeCode Available	2
Foundational Models Defining a New Era in Vision: A Survey and Outlook	Jul 25, 2023	Benchmarking	CodeCode Available	2
EvalGIM: A Library for Evaluating Generative Image Models	Dec 13, 2024	BenchmarkingDiversity	CodeCode Available	2
Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing	Apr 3, 2025	BenchmarkingLogical Reasoning	CodeCode Available	2

Show:10 25 50

← PrevPage 4 of 111Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified