Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1501–1550 of 5548 papers

Title	Date	Tasks	Status
Disentangling coincident cell events using deep transfer learning and compressive sensing	Jul 17, 2025	BenchmarkingCompressive Sensing	—Unverified
DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition	Jul 16, 2025	BenchmarkingKnowledge Distillation	CodeCode Available
A Multi-View High-Resolution Foot-Ankle Complex Point Cloud Dataset During Gait for Occlusion-Robust 3D Completion	Jul 15, 2025	BenchmarkingPoint Cloud Completion	—Unverified
DCR: Quantifying Data Contamination in LLMs Evaluation	Jul 15, 2025	Arithmetic ReasoningBenchmarking	CodeCode Available
FLsim: A Modular and Library-Agnostic Simulation Framework for Federated Learning	Jul 15, 2025	BenchmarkingFederated Learning	CodeCode Available
CodeJudgeBench: Benchmarking LLM-as-a-Judge for Coding Tasks	Jul 14, 2025	BenchmarkingCode Generation	—Unverified
Benchmarking and Evaluation of AI Models in Biology: Outcomes and Recommendations from the CZI Virtual Cells Workshop	Jul 14, 2025	Benchmarking	—Unverified
MLAR: Multi-layer Large Language Model-based Robotic Process Automation Applicant Tracking	Jul 14, 2025	BenchmarkingLanguage Modeling	—Unverified
CodeAssistBench (CAB): Dataset & Benchmarking for Multi-turn Chat-Based Code Assistance	Jul 14, 2025	BenchmarkingCode Generation	—Unverified
Ref-Long: Benchmarking the Long-context Referencing Capability of Long-context Language Models	Jul 13, 2025	AttributeBenchmarking	CodeCode Available
Calibrated and Robust Foundation Models for Vision-Language and Medical Image Tasks Under Distribution Shift	Jul 12, 2025	BenchmarkingTransfer Learning	—Unverified
Identifying the Smallest Adversarial Load Perturbations that Render DC-OPF Infeasible	Jul 10, 2025	Adversarial AttackBenchmarking	CodeCode Available
Benchmarking Waitlist Mortality Prediction in Heart Transplantation Through Time-to-Event Modeling using New Longitudinal UNOS Dataset	Jul 9, 2025	BenchmarkingDecision Making	—Unverified
Orchestrator-Agent Trust: A Modular Agentic AI Visual Classification System with Trust-Aware Orchestration and RAG-Based Reasoning	Jul 9, 2025	BenchmarkingImage Retrieval	CodeCode Available
SQLBarber: A System Leveraging Large Language Models to Generate Customized and Realistic SQL Workloads	Jul 8, 2025	Benchmarking	—Unverified
Hyperspectral Anomaly Detection Methods: A Survey and Comparative Study	Jul 8, 2025	Anomaly DetectionBenchmarking	—Unverified
A Systematic Analysis of Hybrid Linear Attention	Jul 8, 2025	BenchmarkingLanguage Modeling	—Unverified
SenseShift6D: Multimodal RGB-D Benchmarking for Robust 6D Pose Estimation across Environment and Sensor Variations	Jul 8, 2025	6D Pose Estimation6D Pose Estimation using RGB	CodeCode Available
Inaugural MOASEI Competition at AAMAS'2025: A Technical Report	Jul 7, 2025	BenchmarkingDecision Making	—Unverified
STRUCTSENSE: A Task-Agnostic Agentic Framework for Structured Information Extraction with Human-In-The-Loop Evaluation and Benchmarking	Jul 4, 2025	BenchmarkingNavigate	CodeCode Available
LANTERN: A Machine Learning Framework for Lipid Nanoparticle Transfection Efficiency Prediction	Jul 3, 2025	Benchmarking	CodeCode Available
CORE: Benchmarking LLMs Code Reasoning Capabilities through Static Analysis Tasks	Jul 3, 2025	BenchmarkingCode Generation	—Unverified
TransLaw: Benchmarking Large Language Models in Multi-Agent Simulation of the Collaborative Translation	Jul 1, 2025	BenchmarkingMachine Translation	—Unverified
State and Memory is All You Need for Robust and Reliable AI Agents	Jun 30, 2025	AllBenchmarking	—Unverified
Point Cloud Compression and Objective Quality Assessment: A Survey	Jun 28, 2025	Autonomous DrivingBenchmarking	—Unverified
Benchmarking Deep Learning and Vision Foundation Models for Atypical vs. Normal Mitosis Classification with Cross-Dataset Evaluation	Jun 26, 2025	BenchmarkingTransfer Learning	CodeCode Available
mTSBench: Benchmarking Multivariate Time Series Anomaly Detection and Model Selection at Scale	Jun 26, 2025	Anomaly DetectionBenchmarking	CodeCode Available
Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge	Jun 26, 2025	Benchmarking	—Unverified
FeDa4Fair: Client-Level Federated Datasets for Fairness Evaluation	Jun 26, 2025	AttributeBenchmarking	—Unverified
FixCLR: Negative-Class Contrastive Learning for Semi-Supervised Domain Generalization	Jun 25, 2025	BenchmarkingContrastive Learning	—Unverified
Benchmarking Unsupervised Strategies for Anomaly Detection in Multivariate Time Series	Jun 25, 2025	Anomaly DetectionBenchmarking	CodeCode Available
Multimodal Information Retrieval for Open World with Edit Distance Weak Supervision	Jun 25, 2025	BenchmarkingInformation Retrieval	—Unverified
scMamba: A Scalable Foundation Model for Single-Cell Multi-Omics Integration Beyond Highly Variable Feature Selection	Jun 25, 2025	BenchmarkingContrastive Learning	—Unverified
A Survey of Predictive Maintenance Methods: An Analysis of Prognostics via Classification and Regression	Jun 25, 2025	BenchmarkingManagement	—Unverified
HRIBench: Benchmarking Vision-Language Models for Real-Time Human Perception in Human-Robot Interaction	Jun 25, 2025	BenchmarkingPerson Identification	CodeCode Available
AI-Driven MRI-based Brain Tumour Segmentation Benchmarking	Jun 25, 2025	BenchmarkingImage Segmentation	—Unverified
BrokenVideos: A Benchmark Dataset for Fine-Grained Artifact Localization in AI-Generated Videos	Jun 25, 2025	Artifact DetectionBenchmarking	—Unverified
inMOTIFin: a lightweight end-to-end simulation software for regulatory sequences	Jun 25, 2025	Benchmarking	CodeCode Available
MultiHuman-Testbench: Benchmarking Image Generation for Multiple Humans	Jun 25, 2025	Action DetectionBenchmarking	—Unverified
Quantitative Benchmarking of Anomaly Detection Methods in Digital Pathology	Jun 24, 2025	Anomaly DetectionArtifact Detection	—Unverified
MDR-DeePC: Model-Inspired Distributionally Robust Data-Enabled Predictive Control	Jun 24, 2025	Benchmarking	—Unverified
QHackBench: Benchmarking Large Language Models for Quantum Code Generation Using PennyLane Hackathon Challenges	Jun 24, 2025	BenchmarkingCode Generation	—Unverified
Staining normalization in histopathology: Method benchmarking using multicenter dataset	Jun 23, 2025	Benchmarking	—Unverified
Simulation-Based Sensitivity Analysis in Optimal Treatment Regimes and Causal Decomposition with Individualized Interventions	Jun 23, 2025	BenchmarkingSensitivity	—Unverified
Generalizing Vision-Language Models to Novel Domains: A Comprehensive Survey	Jun 23, 2025	BenchmarkingSurvey	—Unverified
Benchmarking Music Generation Models and Metrics via Human Preference Studies	Jun 23, 2025	BenchmarkingMusic Generation	—Unverified
Survey of HPC in US Research Institutions	Jun 23, 2025	BenchmarkingGPU	—Unverified
Benchmarking histopathology foundation models in a multi-center dataset for skin cancer subtyping	Jun 23, 2025	BenchmarkingDiversity	CodeCode Available
Statistical Multicriteria Evaluation of LLM-Generated Text	Jun 22, 2025	BenchmarkingDiversity	CodeCode Available
On the Robustness of Human-Object Interaction Detection against Distribution Shift	Jun 22, 2025	BenchmarkingData Augmentation	—Unverified

Show:10 25 50

← PrevPage 31 of 111Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified