Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 76–100 of 5548 papers

Title	Date	Tasks	Status	Hype
GUI-Robust: A Comprehensive Dataset for Testing GUI Agent Robustness in Real-World Anomalies	Jun 17, 2025	Benchmarking	CodeCode Available	1
ImpliRet: Benchmarking the Implicit Fact Retrieval Challenge	Jun 17, 2025	BenchmarkingRetrieval	CodeCode Available	0
Egocentric Human-Object Interaction Detection: A New Benchmark and Method	Jun 17, 2025	BenchmarkingHuman-Object Interaction Detection	—Unverified	0
The Price of Freedom: Exploring Expressivity and Runtime Tradeoffs in Equivariant Tensor Products	Jun 16, 2025	Benchmarking	CodeCode Available	1
C-TLSAN: Content-Enhanced Time-Aware Long- and Short-Term Attention Network for Personalized Recommendation	Jun 16, 2025	BenchmarkingRecommendation Systems	CodeCode Available	0
A Comprehensive Survey on Video Scene Parsing:Advances, Challenges, and Prospects	Jun 16, 2025	BenchmarkingInstance Segmentation	—Unverified	0
Deep Diffusion Models and Unsupervised Hyperspectral Unmixing for Realistic Abundance Map Synthesis	Jun 16, 2025	BenchmarkingData Augmentation	—Unverified	0
Few-Shot Learning for Industrial Time Series: A Comparative Analysis Using the Example of Screw-Fastening Process Monitoring	Jun 16, 2025	BenchmarkingFew-Shot Learning	—Unverified	0
Robustness of Reinforcement Learning-Based Traffic Signal Control under Incidents: A Comparative Study	Jun 16, 2025	BenchmarkingTraffic Signal Control	—Unverified	0
JENGA: Object selection and pose estimation for robotic grasping from a stack	Jun 16, 2025	BenchmarkingObject	—Unverified	0
ComplexBench-Edit: Benchmarking Complex Instruction-Driven Image Editing via Compositional Dependencies	Jun 15, 2025	Benchmarking	CodeCode Available	1
A large-scale, physically-based synthetic dataset for satellite pose estimation	Jun 15, 2025	BenchmarkingDataset Generation	—Unverified	0
MLDebugging: Towards Benchmarking Code Debugging Across Multi-Library Scenarios	Jun 15, 2025	Benchmarking	CodeCode Available	0
OpenUnlearning: Accelerating LLM Unlearning via Unified Benchmarking of Methods and Metrics	Jun 14, 2025	Benchmarking	CodeCode Available	4
Delving into Instance-Dependent Label Noise in Graph Data: A Comprehensive Study and Benchmark	Jun 14, 2025	BenchmarkingGraph Learning	CodeCode Available	0
ANIRA: An Architecture for Neural Network Inference in Real-Time Audio Applications	Jun 14, 2025	Benchmarking	CodeCode Available	3
Learning Best Paths in Quantum Networks	Jun 14, 2025	Benchmarking	—Unverified	0
Benchmarking Multimodal LLMs on Recognition and Understanding over Chemical Tables	Jun 13, 2025	BenchmarkingDescriptive	—Unverified	0
SemanticST: Spatially Informed Semantic Graph Learning for Clustering, Integration, and Scalable Analysis of Spatial Transcriptomics	Jun 13, 2025	BenchmarkingContrastive Learning	—Unverified	0
Temporal cross-validation impacts multivariate time series subsequence anomaly detection evaluation	Jun 13, 2025	Anomaly DetectionBenchmarking	—Unverified	0
crossMoDA Challenge: Evolution of Cross-Modality Domain Adaptation Techniques for Vestibular Schwannoma and Cochlea Segmentation from 2021 to 2023	Jun 13, 2025	BenchmarkingDomain Adaptation	—Unverified	0
EconGym: A Scalable AI Testbed with Diverse Economic Tasks	Jun 13, 2025	Benchmarking	—Unverified	0
Mind the XAI Gap: A Human-Centered LLM Framework for Democratizing Explainable AI	Jun 13, 2025	BenchmarkingIn-Context Learning	CodeCode Available	0
SEC-bench: Automated Benchmarking of LLM Agents on Real-World Software Security Tasks	Jun 13, 2025	BenchmarkingLarge Language Model	CodeCode Available	2
HyBiomass: Global Hyperspectral Imagery Benchmark Dataset for Evaluating Geospatial Foundation Models in Forest Aboveground Biomass Estimation	Jun 12, 2025	Benchmarking	—Unverified	0

Show:10 25 50

← PrevPage 4 of 222Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified