Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 301–325 of 5548 papers

Title	Date	Tasks	Status	Hype	Score
Exponentially Faster Language Modelling	Nov 15, 2023	BenchmarkingCPU	CodeCode Available	2	5
Extended Agriculture-Vision: An Extension of a Large Aerial Image Dataset for Agricultural Pattern Analysis	Mar 4, 2023	BenchmarkingContrastive Learning	CodeCode Available	2	5
An OpenMind for 3D medical vision self-supervised learning	Dec 22, 2024	BenchmarkingSelf-Supervised Learning	CodeCode Available	2	5
FaceScore: Benchmarking and Enhancing Face Quality in Human Generation	Jun 24, 2024	BenchmarkingDenoising	CodeCode Available	2	5
FluidLab: A Differentiable Environment for Benchmarking Complex Fluid Manipulation	Mar 4, 2023	BenchmarkingGPU	CodeCode Available	2	5
FormalMATH: Benchmarking Formal Mathematical Reasoning of Large Language Models	May 5, 2025	BenchmarkingMathematical Reasoning	CodeCode Available	2	5
FetalCLIP: A Visual-Language Foundation Model for Fetal Ultrasound Image Analysis	Feb 20, 2025	Age EstimationBenchmarking	CodeCode Available	2	5
GDGB: A Benchmark for Generative Dynamic Text-Attributed Graph Learning	Jul 4, 2025	BenchmarkingGraph Generation	CodeCode Available	2	5
HumanRefiner: Benchmarking Abnormal Human Generation and Refining with Coarse-to-fine Pose-Reversible Guidance	Jul 9, 2024	BenchmarkingConditional Image Generation	CodeCode Available	2	5
DaisyRec 2.0: Benchmarking Recommendation for Rigorous Evaluation	Jun 22, 2022	BenchmarkingRecommendation Systems	CodeCode Available	2	5
Benchmarking Deep Reinforcement Learning for Continuous Control	Apr 22, 2016	Action Triplet RecognitionAtari Games	CodeCode Available	2	5
GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks	Nov 28, 2024	BenchmarkingObject Counting	CodeCode Available	2	5
Datasets and Benchmarks for Offline Safe Reinforcement Learning	Jun 15, 2023	Autonomous DrivingBenchmarking	CodeCode Available	2	5
GPT-Fathom: Benchmarking Large Language Models to Decipher the Evolutionary Path towards GPT-4 and Beyond	Sep 28, 2023	Benchmarking	CodeCode Available	2	5
GSplatLoc: Grounding Keypoint Descriptors into 3D Gaussian Splatting for Improved Visual Localization	Sep 24, 2024	3D geometry3DGS	CodeCode Available	2	5
GV-Bench: Benchmarking Local Feature Matching for Geometric Verification of Long-term Loop Closure Detection	Jul 16, 2024	BenchmarkingLoop Closure Detection	CodeCode Available	2	5
CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and Interactions	May 24, 2025	Benchmarking	CodeCode Available	2	5
Customizable Perturbation Synthesis for Robust SLAM Benchmarking	Feb 12, 2024	BenchmarkingSimultaneous Localization and Mapping	CodeCode Available	2	5
Decouple and Track: Benchmarking and Improving Video Diffusion Transformers for Motion Transfer	Mar 21, 2025	BenchmarkingVideo Generation	CodeCode Available	2	5
CORAL: Benchmarking Multi-turn Conversational Retrieval-Augmentation Generation	Oct 30, 2024	BenchmarkingPassage Retrieval	CodeCode Available	2	5
IML-ViT: Benchmarking Image Manipulation Localization by Vision Transformer	Jul 27, 2023	BenchmarkingImage Manipulation	CodeCode Available	2	5
AiTLAS: Artificial Intelligence Toolbox for Earth Observation	Jan 21, 2022	BenchmarkingEarth Observation	CodeCode Available	2	5
InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents	Mar 5, 2024	BenchmarkingLanguage Modeling	CodeCode Available	2	5
InjecGuard: Benchmarking and Mitigating Over-defense in Prompt Injection Guardrail Models	Oct 30, 2024	Benchmarking	CodeCode Available	2	5
CoqPilot, a plugin for LLM-based generation of proofs	Oct 25, 2024	Benchmarking	CodeCode Available	2	5

Show:10 25 50

← PrevPage 13 of 222Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified