Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1–50 of 5548 papers

Title	Date	Tasks	Status	Hype
Visual Place Recognition for Large-Scale UAV Applications	Jul 20, 2025	BenchmarkingDiversity	—Unverified	0
MUPAX: Multidimensional Problem Agnostic eXplainable AI	Jul 17, 2025	Anatomical Landmark DetectionAudio Classification	—Unverified	0
Training Transformers with Enforced Lipschitz Constants	Jul 17, 2025	Benchmarking	—Unverified	0
Disentangling coincident cell events using deep transfer learning and compressive sensing	Jul 17, 2025	BenchmarkingCompressive Sensing	—Unverified	0
DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition	Jul 16, 2025	BenchmarkingKnowledge Distillation	CodeCode Available	0
DrafterBench: Benchmarking Large Language Models for Tasks Automation in Civil Engineering	Jul 15, 2025	BenchmarkingInstruction Following	CodeCode Available	2
FLsim: A Modular and Library-Agnostic Simulation Framework for Federated Learning	Jul 15, 2025	BenchmarkingFederated Learning	CodeCode Available	0
A Multi-View High-Resolution Foot-Ankle Complex Point Cloud Dataset During Gait for Occlusion-Robust 3D Completion	Jul 15, 2025	BenchmarkingPoint Cloud Completion	—Unverified	0
DCR: Quantifying Data Contamination in LLMs Evaluation	Jul 15, 2025	Arithmetic ReasoningBenchmarking	CodeCode Available	0
CodeAssistBench (CAB): Dataset & Benchmarking for Multi-turn Chat-Based Code Assistance	Jul 14, 2025	BenchmarkingCode Generation	—Unverified	0
CodeJudgeBench: Benchmarking LLM-as-a-Judge for Coding Tasks	Jul 14, 2025	BenchmarkingCode Generation	—Unverified	0
Benchmarking and Evaluation of AI Models in Biology: Outcomes and Recommendations from the CZI Virtual Cells Workshop	Jul 14, 2025	Benchmarking	—Unverified	0
MLAR: Multi-layer Large Language Model-based Robotic Process Automation Applicant Tracking	Jul 14, 2025	BenchmarkingLanguage Modeling	—Unverified	0
Ref-Long: Benchmarking the Long-context Referencing Capability of Long-context Language Models	Jul 13, 2025	AttributeBenchmarking	CodeCode Available	0
Calibrated and Robust Foundation Models for Vision-Language and Medical Image Tasks Under Distribution Shift	Jul 12, 2025	BenchmarkingTransfer Learning	—Unverified	0
Identifying the Smallest Adversarial Load Perturbations that Render DC-OPF Infeasible	Jul 10, 2025	Adversarial AttackBenchmarking	CodeCode Available	0
Orchestrator-Agent Trust: A Modular Agentic AI Visual Classification System with Trust-Aware Orchestration and RAG-Based Reasoning	Jul 9, 2025	BenchmarkingImage Retrieval	CodeCode Available	0
Benchmarking Waitlist Mortality Prediction in Heart Transplantation Through Time-to-Event Modeling using New Longitudinal UNOS Dataset	Jul 9, 2025	BenchmarkingDecision Making	—Unverified	0
A Systematic Analysis of Hybrid Linear Attention	Jul 8, 2025	BenchmarkingLanguage Modeling	—Unverified	0
Hyperspectral Anomaly Detection Methods: A Survey and Comparative Study	Jul 8, 2025	Anomaly DetectionBenchmarking	—Unverified	0
SenseShift6D: Multimodal RGB-D Benchmarking for Robust 6D Pose Estimation across Environment and Sensor Variations	Jul 8, 2025	6D Pose Estimation6D Pose Estimation using RGB	CodeCode Available	0
SQLBarber: A System Leveraging Large Language Models to Generate Customized and Realistic SQL Workloads	Jul 8, 2025	Benchmarking	—Unverified	0
Inaugural MOASEI Competition at AAMAS'2025: A Technical Report	Jul 7, 2025	BenchmarkingDecision Making	—Unverified	0
LLMThinkBench: Towards Basic Math Reasoning and Overthinking in Large Language Models	Jul 5, 2025	BenchmarkingGPU	CodeCode Available	1
GDGB: A Benchmark for Generative Dynamic Text-Attributed Graph Learning	Jul 4, 2025	BenchmarkingGraph Generation	CodeCode Available	2
STRUCTSENSE: A Task-Agnostic Agentic Framework for Structured Information Extraction with Human-In-The-Loop Evaluation and Benchmarking	Jul 4, 2025	BenchmarkingNavigate	CodeCode Available	0
LANTERN: A Machine Learning Framework for Lipid Nanoparticle Transfection Efficiency Prediction	Jul 3, 2025	Benchmarking	CodeCode Available	0
Latent Thermodynamic Flows: Unified Representation Learning and Generative Modeling of Temperature-Dependent Behaviors from Limited Data	Jul 3, 2025	BenchmarkingRepresentation Learning	CodeCode Available	1
CORE: Benchmarking LLMs Code Reasoning Capabilities through Static Analysis Tasks	Jul 3, 2025	BenchmarkingCode Generation	—Unverified	0
TransLaw: Benchmarking Large Language Models in Multi-Agent Simulation of the Collaborative Translation	Jul 1, 2025	BenchmarkingMachine Translation	—Unverified	0
State and Memory is All You Need for Robust and Reliable AI Agents	Jun 30, 2025	AllBenchmarking	—Unverified	0
Point Cloud Compression and Objective Quality Assessment: A Survey	Jun 28, 2025	Autonomous DrivingBenchmarking	—Unverified	0
Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge	Jun 26, 2025	Benchmarking	—Unverified	0
mTSBench: Benchmarking Multivariate Time Series Anomaly Detection and Model Selection at Scale	Jun 26, 2025	Anomaly DetectionBenchmarking	CodeCode Available	0
FeDa4Fair: Client-Level Federated Datasets for Fairness Evaluation	Jun 26, 2025	AttributeBenchmarking	—Unverified	0
Benchmarking Deep Learning and Vision Foundation Models for Atypical vs. Normal Mitosis Classification with Cross-Dataset Evaluation	Jun 26, 2025	BenchmarkingTransfer Learning	CodeCode Available	0
CovDocker: Benchmarking Covalent Drug Design with Tasks, Datasets, and Solutions	Jun 26, 2025	BenchmarkingDrug Design	CodeCode Available	1
scMamba: A Scalable Foundation Model for Single-Cell Multi-Omics Integration Beyond Highly Variable Feature Selection	Jun 25, 2025	BenchmarkingContrastive Learning	—Unverified	0
MultiHuman-Testbench: Benchmarking Image Generation for Multiple Humans	Jun 25, 2025	Action DetectionBenchmarking	—Unverified	0
FixCLR: Negative-Class Contrastive Learning for Semi-Supervised Domain Generalization	Jun 25, 2025	BenchmarkingContrastive Learning	—Unverified	0
AI-Driven MRI-based Brain Tumour Segmentation Benchmarking	Jun 25, 2025	BenchmarkingImage Segmentation	—Unverified	0
inMOTIFin: a lightweight end-to-end simulation software for regulatory sequences	Jun 25, 2025	Benchmarking	CodeCode Available	0
HRIBench: Benchmarking Vision-Language Models for Real-Time Human Perception in Human-Robot Interaction	Jun 25, 2025	BenchmarkingPerson Identification	CodeCode Available	0
Multimodal Information Retrieval for Open World with Edit Distance Weak Supervision	Jun 25, 2025	BenchmarkingInformation Retrieval	—Unverified	0
Benchmarking Unsupervised Strategies for Anomaly Detection in Multivariate Time Series	Jun 25, 2025	Anomaly DetectionBenchmarking	CodeCode Available	0
A Survey of Predictive Maintenance Methods: An Analysis of Prognostics via Classification and Regression	Jun 25, 2025	BenchmarkingManagement	—Unverified	0
BrokenVideos: A Benchmark Dataset for Fine-Grained Artifact Localization in AI-Generated Videos	Jun 25, 2025	Artifact DetectionBenchmarking	—Unverified	0
WattsOnAI: Measuring, Analyzing, and Visualizing Energy and Carbon Footprint of AI Workloads	Jun 25, 2025	Benchmarking	CodeCode Available	1
Quantitative Benchmarking of Anomaly Detection Methods in Digital Pathology	Jun 24, 2025	Anomaly DetectionArtifact Detection	—Unverified	0
MDR-DeePC: Model-Inspired Distributionally Robust Data-Enabled Predictive Control	Jun 24, 2025	Benchmarking	—Unverified	0

Show:10 25 50

← PrevPage 1 of 111Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified