| SOP-Bench: Complex Industrial SOPs for Evaluating LLM Agents | Jun 9, 2025 | BenchmarkingSynthetic Data Generation | —Unverified | 0 |
| The Catechol Benchmark: Time-series Solvent Selection Data for Few-shot Machine Learning | Jun 9, 2025 | Active LearningBenchmarking | CodeCode Available | 0 |
| REMoH: A Reflective Evolution of Multi-objective Heuristics approach via Large Language Models | Jun 9, 2025 | BenchmarkingDecision Making | —Unverified | 0 |
| SurgBench: A Unified Large-Scale Benchmark for Surgical Video Analysis | Jun 9, 2025 | Action ClassificationBenchmarking | —Unverified | 0 |
| GIQ: Benchmarking 3D Geometric Reasoning of Vision Foundation Models with Simulated and Real Polyhedra | Jun 9, 2025 | 3D ReconstructionBenchmarking | —Unverified | 0 |
| CuRe: Cultural Gaps in the Long Tail of Text-to-Image Systems | Jun 9, 2025 | AttributeBenchmarking | CodeCode Available | 0 |
| Generative Models at the Frontier of Compression: A Survey on Generative Face Video Coding | Jun 9, 2025 | BenchmarkingVideo Compression | —Unverified | 0 |
| HuSc3D: Human Sculpture dataset for 3D object reconstruction | Jun 9, 2025 | 3D Object Reconstruction3D Reconstruction | CodeCode Available | 0 |
| How Far Are We from Optimal Reasoning Efficiency? | Jun 8, 2025 | 16kBenchmarking | CodeCode Available | 0 |
| LoopDB: A Loop Closure Dataset for Large Scale Simultaneous Localization and Mapping | Jun 7, 2025 | BenchmarkingSimultaneous Localization and Mapping | CodeCode Available | 0 |
| MCA-Bench: A Multimodal Benchmark for Evaluating CAPTCHA Robustness Against VLM-based Attacks | Jun 6, 2025 | Benchmarking | CodeCode Available | 0 |
| BestServe: Serving Strategies with Optimal Goodput in Collocation and Disaggregation Architectures | Jun 6, 2025 | BenchmarkingCPU | —Unverified | 0 |
| DeepFake Doctor: Diagnosing and Treating Audio-Video Fake Detection | Jun 6, 2025 | BenchmarkingDeepFake Detection | —Unverified | 0 |
| Numerical Investigation of Sequence Modeling Theory using Controllable Memory Functions | Jun 6, 2025 | BenchmarkingState Space Models | —Unverified | 0 |
| Benchmarking Misuse Mitigation Against Covert Adversaries | Jun 6, 2025 | BenchmarkingLanguage Modeling | CodeCode Available | 0 |
| Towards Efficient Multi-LLM Inference: Characterization and Analysis of LLM Routing and Hierarchical Techniques | Jun 6, 2025 | BenchmarkingModel Selection | —Unverified | 0 |
| EMO-Debias: Benchmarking Gender Debiasing Techniques in Multi-Label Speech Emotion Recognition | Jun 5, 2025 | BenchmarkingEmotion Recognition | —Unverified | 0 |
| Design of intelligent proofreading system for English translation based on CNN and BERT | Jun 5, 2025 | BenchmarkingMachine Translation | —Unverified | 0 |
| HoliSafe: Holistic Safety Benchmarking and Modeling with Safety Meta Token for Vision-Language Model | Jun 5, 2025 | BenchmarkingLanguage Modeling | —Unverified | 0 |
| From Standalone LLMs to Integrated Intelligence: A Survey of Compound Al Systems | Jun 5, 2025 | BenchmarkingRAG | —Unverified | 0 |
| BSBench: will your LLM find the largest prime number? | Jun 5, 2025 | Benchmarking | CodeCode Available | 0 |
| AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMs | Jun 5, 2025 | BenchmarkingVideo Understanding | —Unverified | 0 |
| CzechLynx: A Dataset for Individual Identification and Pose Estimation of the Eurasian Lynx | Jun 5, 2025 | 2D Pose EstimationBenchmarking | —Unverified | 0 |
| DIMCIM: A Quantitative Evaluation Framework for Default-mode Diversity and Generalization in Text-to-Image Generative Models | Jun 5, 2025 | BenchmarkingDiversity | —Unverified | 0 |
| A Unified Framework for Provably Efficient Algorithms to Estimate Shapley Values | Jun 5, 2025 | Benchmarking | —Unverified | 0 |