Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1551–1600 of 5548 papers

Title	Date	Tasks	Status
Statistical Multicriteria Evaluation of LLM-Generated Text	Jun 22, 2025	BenchmarkingDiversity	CodeCode Available
Leveling the Playing Field: Carefully Comparing Classical and Learned Controllers for Quadrotor Trajectory Tracking	Jun 21, 2025	BenchmarkingReinforcement Learning (RL)	—Unverified
A Comparative Analysis of Principal Component Analysis (PCA) and Singular Value Decomposition (SVD) as Dimensionality Reduction Techniques	Jun 20, 2025	BenchmarkingDimensionality Reduction	—Unverified
Universal Music Representations? Evaluating Foundation Models on World Music Corpora	Jun 20, 2025	BenchmarkingFew-Shot Learning	CodeCode Available
Spotting tell-tale visual artifacts in face swapping videos: strengths and pitfalls of CNN detectors	Jun 19, 2025	BenchmarkingFace Swapping	—Unverified
OSWorld-Human: Benchmarking the Efficiency of Computer-Use Agents	Jun 19, 2025	Benchmarking	—Unverified
Finance Language Model Evaluation (FLaME)	Jun 18, 2025	BenchmarkingLanguage Model Evaluation	—Unverified
PGLib-CO2: A Power Grid Library for Computing and Optimizing Carbon Emissions	Jun 17, 2025	Benchmarking	—Unverified
Q2SAR: A Quantum Multiple Kernel Learning Approach for Drug Discovery	Jun 17, 2025	BenchmarkingDrug Discovery	—Unverified
ImpliRet: Benchmarking the Implicit Fact Retrieval Challenge	Jun 17, 2025	BenchmarkingRetrieval	CodeCode Available
A large-scale heterogeneous 3D magnetic resonance brain imaging dataset for self-supervised learning	Jun 17, 2025	BenchmarkingSelf-Supervised Learning	—Unverified
Egocentric Human-Object Interaction Detection: A New Benchmark and Method	Jun 17, 2025	BenchmarkingHuman-Object Interaction Detection	—Unverified
Few-Shot Learning for Industrial Time Series: A Comparative Analysis Using the Example of Screw-Fastening Process Monitoring	Jun 16, 2025	BenchmarkingFew-Shot Learning	—Unverified
JENGA: Object selection and pose estimation for robotic grasping from a stack	Jun 16, 2025	BenchmarkingObject	—Unverified
A Comprehensive Survey on Video Scene Parsing:Advances, Challenges, and Prospects	Jun 16, 2025	BenchmarkingInstance Segmentation	—Unverified
Deep Diffusion Models and Unsupervised Hyperspectral Unmixing for Realistic Abundance Map Synthesis	Jun 16, 2025	BenchmarkingData Augmentation	—Unverified
C-TLSAN: Content-Enhanced Time-Aware Long- and Short-Term Attention Network for Personalized Recommendation	Jun 16, 2025	BenchmarkingRecommendation Systems	CodeCode Available
Robustness of Reinforcement Learning-Based Traffic Signal Control under Incidents: A Comparative Study	Jun 16, 2025	BenchmarkingTraffic Signal Control	—Unverified
MLDebugging: Towards Benchmarking Code Debugging Across Multi-Library Scenarios	Jun 15, 2025	Benchmarking	CodeCode Available
A large-scale, physically-based synthetic dataset for satellite pose estimation	Jun 15, 2025	BenchmarkingDataset Generation	—Unverified
Learning Best Paths in Quantum Networks	Jun 14, 2025	Benchmarking	—Unverified
Delving into Instance-Dependent Label Noise in Graph Data: A Comprehensive Study and Benchmark	Jun 14, 2025	BenchmarkingGraph Learning	CodeCode Available
EconGym: A Scalable AI Testbed with Diverse Economic Tasks	Jun 13, 2025	Benchmarking	—Unverified
Temporal cross-validation impacts multivariate time series subsequence anomaly detection evaluation	Jun 13, 2025	Anomaly DetectionBenchmarking	—Unverified
SemanticST: Spatially Informed Semantic Graph Learning for Clustering, Integration, and Scalable Analysis of Spatial Transcriptomics	Jun 13, 2025	BenchmarkingContrastive Learning	—Unverified
Mind the XAI Gap: A Human-Centered LLM Framework for Democratizing Explainable AI	Jun 13, 2025	BenchmarkingIn-Context Learning	CodeCode Available
crossMoDA Challenge: Evolution of Cross-Modality Domain Adaptation Techniques for Vestibular Schwannoma and Cochlea Segmentation from 2021 to 2023	Jun 13, 2025	BenchmarkingDomain Adaptation	—Unverified
Benchmarking Multimodal LLMs on Recognition and Understanding over Chemical Tables	Jun 13, 2025	BenchmarkingDescriptive	—Unverified
OIBench: Benchmarking Strong Reasoning Models with Olympiad in Informatics	Jun 12, 2025	Benchmarking	—Unverified
HyBiomass: Global Hyperspectral Imagery Benchmark Dataset for Evaluating Geospatial Foundation Models in Forest Aboveground Biomass Estimation	Jun 12, 2025	Benchmarking	—Unverified
Primender Sequence: A Novel Mathematical Construct for Testing Symbolic Inference and AI Reasoning	Jun 12, 2025	Benchmarking	—Unverified
Sum Rate Maximization for Pinching Antennas Assisted RSMA System With Multiple Waveguides	Jun 12, 2025	Benchmarking	—Unverified
FedVLMBench: Benchmarking Federated Fine-Tuning of Vision-Language Models	Jun 11, 2025	BenchmarkingFederated Learning	—Unverified
HopaDIFF: Holistic-Partial Aware Fourier Conditioned Diffusion for Referring Human Action Segmentation in Multi-Person Scenarios	Jun 11, 2025	Action RecognitionAction Segmentation	CodeCode Available
ScholarSearch: Benchmarking Scholar Searching Ability of LLMs	Jun 11, 2025	BenchmarkingInformation Retrieval	—Unverified
ICE-ID: A Novel Historical Census Data Benchmark Comparing NARS against LLMs, \& a ML Ensemble on Longitudinal Identity Resolution	Jun 11, 2025	Benchmarking	—Unverified
Bench to the Future: A Pastcasting Benchmark for Forecasting Agents	Jun 11, 2025	Benchmarking	—Unverified
Reasoning as a Resource: Optimizing Fast and Slow Thinking in Code Generation Models	Jun 11, 2025	BenchmarkingCode Generation	—Unverified
GRAIL: A Benchmark for GRaph ActIve Learning in Dynamic Sensing Environments	Jun 11, 2025	Active LearningBenchmarking	—Unverified
A Manually Annotated Image-Caption Dataset for Detecting Children in the Wild	Jun 11, 2025	Age EstimationBenchmarking	CodeCode Available
Large Language Models Have Intrinsic Meta-Cognition, but Need a Good Lens	Jun 10, 2025	BenchmarkingMathematical Reasoning	—Unverified
Graph Attention-based Decentralized Actor-Critic for Dual-Objective Control of Multi-UAV Swarms	Jun 10, 2025	BenchmarkingGraph Attention	—Unverified
AraReasoner: Evaluating Reasoning-Based LLMs for Arabic NLP	Jun 10, 2025	BenchmarkingSentiment Analysis	—Unverified
Solving excited states for long-range interacting trapped ions with neural networks	Jun 10, 2025	Benchmarking	—Unverified
Benchmarking Foundation Speech and Language Models for Alzheimer's Disease and Related Dementia Detection from Spontaneous Speech	Jun 9, 2025	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified
Ensuring Reliability of Curated EHR-Derived Data: The Validation of Accuracy for LLM/ML-Extracted Information and Data (VALID) Framework	Jun 9, 2025	BenchmarkingFairness	—Unverified
GradEscape: A Gradient-Based Evader Against AI-Generated Text Detectors	Jun 9, 2025	BenchmarkingModel extraction	—Unverified
Benchmarking Pre-Trained Time Series Models for Electricity Price Forecasting	Jun 9, 2025	BenchmarkingDecision Making	—Unverified
The Catechol Benchmark: Time-series Solvent Selection Data for Few-shot Machine Learning	Jun 9, 2025	Active LearningBenchmarking	CodeCode Available
Generative Models at the Frontier of Compression: A Survey on Generative Face Video Coding	Jun 9, 2025	BenchmarkingVideo Compression	—Unverified

Show:10 25 50

← PrevPage 32 of 111Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified