Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 101–125 of 5548 papers

Title	Date	Tasks	Status	Hype
Sum Rate Maximization for Pinching Antennas Assisted RSMA System With Multiple Waveguides	Jun 12, 2025	Benchmarking	—Unverified	0
OIBench: Benchmarking Strong Reasoning Models with Olympiad in Informatics	Jun 12, 2025	Benchmarking	—Unverified	0
Primender Sequence: A Novel Mathematical Construct for Testing Symbolic Inference and AI Reasoning	Jun 12, 2025	Benchmarking	—Unverified	0
SDialog: A Python Toolkit for Synthetic Dialogue Generation and Analysis	Jun 12, 2025	BenchmarkingDialogue Generation	CodeCode Available	2
Bench to the Future: A Pastcasting Benchmark for Forecasting Agents	Jun 11, 2025	Benchmarking	—Unverified	0
ICE-ID: A Novel Historical Census Data Benchmark Comparing NARS against LLMs, \& a ML Ensemble on Longitudinal Identity Resolution	Jun 11, 2025	Benchmarking	—Unverified	0
ScholarSearch: Benchmarking Scholar Searching Ability of LLMs	Jun 11, 2025	BenchmarkingInformation Retrieval	—Unverified	0
Reasoning as a Resource: Optimizing Fast and Slow Thinking in Code Generation Models	Jun 11, 2025	BenchmarkingCode Generation	—Unverified	0
FedVLMBench: Benchmarking Federated Fine-Tuning of Vision-Language Models	Jun 11, 2025	BenchmarkingFederated Learning	—Unverified	0
Attention, Please! Revisiting Attentive Probing for Masked Image Modeling	Jun 11, 2025	BenchmarkingComputational Efficiency	CodeCode Available	1
GLGENN: A Novel Parameter-Light Equivariant Neural Networks Architecture Based on Clifford Geometric Algebras	Jun 11, 2025	Benchmarking	CodeCode Available	1
HopaDIFF: Holistic-Partial Aware Fourier Conditioned Diffusion for Referring Human Action Segmentation in Multi-Person Scenarios	Jun 11, 2025	Action RecognitionAction Segmentation	CodeCode Available	0
IntPhys 2: Benchmarking Intuitive Physics Understanding In Complex Synthetic Environments	Jun 11, 2025	Benchmarking	CodeCode Available	2
A Manually Annotated Image-Caption Dataset for Detecting Children in the Wild	Jun 11, 2025	Age EstimationBenchmarking	CodeCode Available	0
GRAIL: A Benchmark for GRaph ActIve Learning in Dynamic Sensing Environments	Jun 11, 2025	Active LearningBenchmarking	—Unverified	0
Graph Attention-based Decentralized Actor-Critic for Dual-Objective Control of Multi-UAV Swarms	Jun 10, 2025	BenchmarkingGraph Attention	—Unverified	0
scSSL-Bench: Benchmarking Self-Supervised Learning for Single-Cell Data	Jun 10, 2025	BenchmarkingData Augmentation	CodeCode Available	1
CounselBench: A Large-Scale Expert Evaluation and Adversarial Benchmark of Large Language Models in Mental Health Counseling	Jun 10, 2025	Benchmarking	CodeCode Available	1
AraReasoner: Evaluating Reasoning-Based LLMs for Arabic NLP	Jun 10, 2025	BenchmarkingSentiment Analysis	—Unverified	0
Large Language Models Have Intrinsic Meta-Cognition, but Need a Good Lens	Jun 10, 2025	BenchmarkingMathematical Reasoning	—Unverified	0
Solving excited states for long-range interacting trapped ions with neural networks	Jun 10, 2025	Benchmarking	—Unverified	0
Benchmarking Foundation Speech and Language Models for Alzheimer's Disease and Related Dementia Detection from Spontaneous Speech	Jun 9, 2025	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified	0
The Catechol Benchmark: Time-series Solvent Selection Data for Few-shot Machine Learning	Jun 9, 2025	Active LearningBenchmarking	CodeCode Available	0
HuSc3D: Human Sculpture dataset for 3D object reconstruction	Jun 9, 2025	3D Object Reconstruction3D Reconstruction	CodeCode Available	0
REMoH: A Reflective Evolution of Multi-objective Heuristics approach via Large Language Models	Jun 9, 2025	BenchmarkingDecision Making	—Unverified	0

Show:10 25 50

← PrevPage 5 of 222Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified