| Sum Rate Maximization for Pinching Antennas Assisted RSMA System With Multiple Waveguides | Jun 12, 2025 | Benchmarking | —Unverified | 0 |
| OIBench: Benchmarking Strong Reasoning Models with Olympiad in Informatics | Jun 12, 2025 | Benchmarking | —Unverified | 0 |
| Primender Sequence: A Novel Mathematical Construct for Testing Symbolic Inference and AI Reasoning | Jun 12, 2025 | Benchmarking | —Unverified | 0 |
| SDialog: A Python Toolkit for Synthetic Dialogue Generation and Analysis | Jun 12, 2025 | BenchmarkingDialogue Generation | CodeCode Available | 2 |
| Bench to the Future: A Pastcasting Benchmark for Forecasting Agents | Jun 11, 2025 | Benchmarking | —Unverified | 0 |
| ICE-ID: A Novel Historical Census Data Benchmark Comparing NARS against LLMs, \& a ML Ensemble on Longitudinal Identity Resolution | Jun 11, 2025 | Benchmarking | —Unverified | 0 |
| ScholarSearch: Benchmarking Scholar Searching Ability of LLMs | Jun 11, 2025 | BenchmarkingInformation Retrieval | —Unverified | 0 |
| Reasoning as a Resource: Optimizing Fast and Slow Thinking in Code Generation Models | Jun 11, 2025 | BenchmarkingCode Generation | —Unverified | 0 |
| Attention, Please! Revisiting Attentive Probing for Masked Image Modeling | Jun 11, 2025 | BenchmarkingComputational Efficiency | CodeCode Available | 1 |
| HopaDIFF: Holistic-Partial Aware Fourier Conditioned Diffusion for Referring Human Action Segmentation in Multi-Person Scenarios | Jun 11, 2025 | Action RecognitionAction Segmentation | CodeCode Available | 0 |
| IntPhys 2: Benchmarking Intuitive Physics Understanding In Complex Synthetic Environments | Jun 11, 2025 | Benchmarking | CodeCode Available | 2 |
| FedVLMBench: Benchmarking Federated Fine-Tuning of Vision-Language Models | Jun 11, 2025 | BenchmarkingFederated Learning | —Unverified | 0 |
| A Manually Annotated Image-Caption Dataset for Detecting Children in the Wild | Jun 11, 2025 | Age EstimationBenchmarking | CodeCode Available | 0 |
| GLGENN: A Novel Parameter-Light Equivariant Neural Networks Architecture Based on Clifford Geometric Algebras | Jun 11, 2025 | Benchmarking | CodeCode Available | 1 |
| GRAIL: A Benchmark for GRaph ActIve Learning in Dynamic Sensing Environments | Jun 11, 2025 | Active LearningBenchmarking | —Unverified | 0 |
| Graph Attention-based Decentralized Actor-Critic for Dual-Objective Control of Multi-UAV Swarms | Jun 10, 2025 | BenchmarkingGraph Attention | —Unverified | 0 |
| scSSL-Bench: Benchmarking Self-Supervised Learning for Single-Cell Data | Jun 10, 2025 | BenchmarkingData Augmentation | CodeCode Available | 1 |
| Large Language Models Have Intrinsic Meta-Cognition, but Need a Good Lens | Jun 10, 2025 | BenchmarkingMathematical Reasoning | —Unverified | 0 |
| CounselBench: A Large-Scale Expert Evaluation and Adversarial Benchmark of Large Language Models in Mental Health Counseling | Jun 10, 2025 | Benchmarking | CodeCode Available | 1 |
| AraReasoner: Evaluating Reasoning-Based LLMs for Arabic NLP | Jun 10, 2025 | BenchmarkingSentiment Analysis | —Unverified | 0 |
| Solving excited states for long-range interacting trapped ions with neural networks | Jun 10, 2025 | Benchmarking | —Unverified | 0 |
| Benchmarking Foundation Speech and Language Models for Alzheimer's Disease and Related Dementia Detection from Spontaneous Speech | Jun 9, 2025 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 |
| The Catechol Benchmark: Time-series Solvent Selection Data for Few-shot Machine Learning | Jun 9, 2025 | Active LearningBenchmarking | CodeCode Available | 0 |
| SurgBench: A Unified Large-Scale Benchmark for Surgical Video Analysis | Jun 9, 2025 | Action ClassificationBenchmarking | —Unverified | 0 |
| Generative Models at the Frontier of Compression: A Survey on Generative Face Video Coding | Jun 9, 2025 | BenchmarkingVideo Compression | —Unverified | 0 |
| REMoH: A Reflective Evolution of Multi-objective Heuristics approach via Large Language Models | Jun 9, 2025 | BenchmarkingDecision Making | —Unverified | 0 |
| HuSc3D: Human Sculpture dataset for 3D object reconstruction | Jun 9, 2025 | 3D Object Reconstruction3D Reconstruction | CodeCode Available | 0 |
| RADAR: Benchmarking Language Models on Imperfect Tabular Data | Jun 9, 2025 | BenchmarkingMissing Values | CodeCode Available | 1 |
| CuRe: Cultural Gaps in the Long Tail of Text-to-Image Systems | Jun 9, 2025 | AttributeBenchmarking | CodeCode Available | 0 |
| Ensuring Reliability of Curated EHR-Derived Data: The Validation of Accuracy for LLM/ML-Extracted Information and Data (VALID) Framework | Jun 9, 2025 | BenchmarkingFairness | —Unverified | 0 |
| GradEscape: A Gradient-Based Evader Against AI-Generated Text Detectors | Jun 9, 2025 | BenchmarkingModel extraction | —Unverified | 0 |
| Benchmarking Pre-Trained Time Series Models for Electricity Price Forecasting | Jun 9, 2025 | BenchmarkingDecision Making | —Unverified | 0 |
| EconWebArena: Benchmarking Autonomous Agents on Economic Tasks in Realistic Web Environments | Jun 9, 2025 | BenchmarkingNavigate | —Unverified | 0 |
| GIQ: Benchmarking 3D Geometric Reasoning of Vision Foundation Models with Simulated and Real Polyhedra | Jun 9, 2025 | 3D ReconstructionBenchmarking | —Unverified | 0 |
| Can AI Validate Science? Benchmarking LLMs for Accurate Scientific Claim Evidence Reasoning | Jun 9, 2025 | BenchmarkingDiagnostic | CodeCode Available | 0 |
| SOP-Bench: Complex Industrial SOPs for Evaluating LLM Agents | Jun 9, 2025 | BenchmarkingSynthetic Data Generation | —Unverified | 0 |
| How Far Are We from Optimal Reasoning Efficiency? | Jun 8, 2025 | 16kBenchmarking | CodeCode Available | 0 |
| LoopDB: A Loop Closure Dataset for Large Scale Simultaneous Localization and Mapping | Jun 7, 2025 | BenchmarkingSimultaneous Localization and Mapping | CodeCode Available | 0 |
| BestServe: Serving Strategies with Optimal Goodput in Collocation and Disaggregation Architectures | Jun 6, 2025 | BenchmarkingCPU | —Unverified | 0 |
| Benchmarking Misuse Mitigation Against Covert Adversaries | Jun 6, 2025 | BenchmarkingLanguage Modeling | CodeCode Available | 0 |
| DeepFake Doctor: Diagnosing and Treating Audio-Video Fake Detection | Jun 6, 2025 | BenchmarkingDeepFake Detection | —Unverified | 0 |
| Towards Efficient Multi-LLM Inference: Characterization and Analysis of LLM Routing and Hierarchical Techniques | Jun 6, 2025 | BenchmarkingModel Selection | —Unverified | 0 |
| FinanceReasoning: Benchmarking Financial Numerical Reasoning More Credible, Comprehensive and Challenging | Jun 6, 2025 | Benchmarking | CodeCode Available | 1 |
| Numerical Investigation of Sequence Modeling Theory using Controllable Memory Functions | Jun 6, 2025 | BenchmarkingState Space Models | —Unverified | 0 |
| MCA-Bench: A Multimodal Benchmark for Evaluating CAPTCHA Robustness Against VLM-based Attacks | Jun 6, 2025 | Benchmarking | CodeCode Available | 0 |
| MMTU: A Massive Multi-Task Table Understanding and Reasoning Benchmark | Jun 5, 2025 | Benchmarking | CodeCode Available | 1 |
| Benchmarking Large Language Models on Homework Assessment in Circuit Analysis | Jun 5, 2025 | Benchmarking | —Unverified | 0 |
| EMO-Debias: Benchmarking Gender Debiasing Techniques in Multi-Label Speech Emotion Recognition | Jun 5, 2025 | BenchmarkingEmotion Recognition | —Unverified | 0 |
| Refer to Anything with Vision-Language Prompts | Jun 5, 2025 | BenchmarkingGeneralized Referring Expression Segmentation | —Unverified | 0 |
| DIMCIM: A Quantitative Evaluation Framework for Default-mode Diversity and Generalization in Text-to-Image Generative Models | Jun 5, 2025 | BenchmarkingDiversity | —Unverified | 0 |