| FM-Planner: Foundation Model Guided Path Planning for Autonomous Drone Navigation | May 27, 2025 | BenchmarkingDecision Making | CodeCode Available | 1 |
| Laparoscopic Image Desmoking Using the U-Net with New Loss Function and Integrated Differentiable Wiener Filter | May 27, 2025 | Benchmarking | CodeCode Available | 0 |
| AMQA: An Adversarial Dataset for Benchmarking Bias of LLMs in Medicine and Healthcare | May 26, 2025 | BenchmarkingMedical Diagnosis | CodeCode Available | 0 |
| Benchmarking Multimodal Knowledge Conflict for Large Multimodal Models | May 26, 2025 | BenchmarkingRAG | CodeCode Available | 1 |
| AgentRecBench: Benchmarking LLM Agent-based Personalized Recommender Systems | May 26, 2025 | BenchmarkingRecommendation Systems | —Unverified | 0 |
| MineAnyBuild: Benchmarking Spatial Planning for Open-world AI Agents | May 26, 2025 | BenchmarkingMinecraft | CodeCode Available | 1 |
| Beyond Specialization: Benchmarking LLMs for Transliteration of Indian Languages | May 26, 2025 | BenchmarkingTransliteration | —Unverified | 0 |
| OB3D: A New Dataset for Benchmarking Omnidirectional 3D Reconstruction Using Blender | May 26, 2025 | 3DGS3D Reconstruction | CodeCode Available | 1 |
| Benchmarking Large Multimodal Models for Ophthalmic Visual Question Answering with OphthalWeChat | May 26, 2025 | BenchmarkingQuestion Answering | —Unverified | 0 |
| A Unified Solution to Video Fusion: From Multi-Frame Learning to Benchmarking | May 26, 2025 | BenchmarkingOptical Flow Estimation | —Unverified | 0 |
| Calibrating Pre-trained Language Classifiers on LLM-generated Noisy Labels via Iterative Refinement | May 26, 2025 | Benchmarking | CodeCode Available | 0 |
| PathBench: A comprehensive comparison benchmark for pathology foundation models towards precision oncology | May 26, 2025 | BenchmarkingPrognosis | —Unverified | 0 |
| TDVE-Assessor: Benchmarking and Evaluating the Quality of Text-Driven Video Editing with LMMs | May 26, 2025 | BenchmarkingLarge Language Model | —Unverified | 0 |
| Automated Text-to-Table for Reasoning-Intensive Table QA: Pipeline Design and Benchmarking Insights | May 26, 2025 | BenchmarkingQuestion Answering | CodeCode Available | 0 |
| FinLoRA: Benchmarking LoRA Methods for Fine-Tuning LLMs on Financial Datasets | May 26, 2025 | BenchmarkingGPU | CodeCode Available | 0 |
| Transformers in Protein: A Survey | May 26, 2025 | BenchmarkingDrug Discovery | —Unverified | 0 |
| Benchmarking and Enhancing LLM Agents in Localizing Linux Kernel Bugs | May 26, 2025 | BenchmarkingFault localization | CodeCode Available | 0 |
| Synthetic Time Series Forecasting with Transformer Architectures: Extensive Simulation Benchmarks | May 26, 2025 | BenchmarkingDecision Making Under Uncertainty | CodeCode Available | 0 |
| EuroCon: Benchmarking Parliament Deliberation for Political Consensus Finding | May 26, 2025 | Benchmarking | —Unverified | 0 |
| StructEval: Benchmarking LLMs' Capabilities to Generate Structural Outputs | May 26, 2025 | Benchmarking | —Unverified | 0 |
| AssistedDS: Benchmarking How External Domain Knowledge Assists LLMs in Automated Data Science | May 25, 2025 | BenchmarkingFeature Engineering | —Unverified | 0 |
| SeePhys: Does Seeing Help Thinking? -- Benchmarking Vision-Based Physics Reasoning | May 25, 2025 | BenchmarkingVisual Reasoning | CodeCode Available | 1 |
| DeepResearchGym: A Free, Transparent, and Reproducible Evaluation Sandbox for Deep Research | May 25, 2025 | BenchmarkingInformation Retrieval | —Unverified | 0 |
| Are Vision Language Models Ready for Clinical Diagnosis? A 3D Medical Benchmark for Tumor-centric Visual Question Answering | May 25, 2025 | AnatomyBenchmarking | CodeCode Available | 1 |
| SpokenNativQA: Multilingual Everyday Spoken Queries for LLMs | May 25, 2025 | BenchmarkingDiversity | —Unverified | 0 |
| Benchmarking Laparoscopic Surgical Image Restoration and Beyond | May 25, 2025 | BenchmarkingImage Restoration | CodeCode Available | 2 |
| Where Paths Collide: A Comprehensive Survey of Classic and Learning-Based Multi-Agent Pathfinding | May 25, 2025 | BenchmarkingMulti-Agent Path Finding | —Unverified | 0 |
| EnvSDD: Benchmarking Environmental Sound Deepfake Detection | May 25, 2025 | Audio Deepfake DetectionAudio Generation | —Unverified | 0 |
| Retrieval-Augmented Generation for Service Discovery: Chunking Strategies and Benchmarking | May 25, 2025 | BenchmarkingChunking | —Unverified | 0 |
| Benchmarking Large Language Models for Cyberbullying Detection in Real-World YouTube Comments | May 25, 2025 | Benchmarking | —Unverified | 0 |
| Towards Emotionally Consistent Text-Based Speech Editing: Introducing EmoCorrector and The ECD-TSE Dataset | May 24, 2025 | BenchmarkingRAG | CodeCode Available | 0 |
| Business as Rulesual: A Benchmark and Framework for Business Rule Flow Modeling with LLMs | May 24, 2025 | Benchmarking | —Unverified | 0 |
| From Generation to Detection: A Multimodal Multi-Task Dataset for Benchmarking Health Misinformation | May 24, 2025 | ArticlesBenchmarking | —Unverified | 0 |
| CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and Interactions | May 24, 2025 | Benchmarking | CodeCode Available | 2 |
| LogicCat: A Chain-of-Thought Text-to-SQL Benchmark for Multi-Domain Reasoning Challenges | May 24, 2025 | BenchmarkingMathematical Reasoning | CodeCode Available | 0 |
| Benchmarking and Rethinking Knowledge Editing for Large Language Models | May 24, 2025 | Benchmarkingknowledge editing | CodeCode Available | 0 |
| SPDEBench: An Extensive Benchmark for Learning Regular and Singular Stochastic PDEs | May 24, 2025 | Benchmarking | CodeCode Available | 0 |
| SAMA: Towards Multi-Turn Referential Grounded Video Chat with Large Language Models | May 24, 2025 | BenchmarkingVideo Grounding | —Unverified | 0 |
| ChartGalaxy: A Dataset for Infographic Chart Understanding and Generation | May 24, 2025 | BenchmarkingChart Understanding | CodeCode Available | 3 |
| Benchmarking Poisoning Attacks against Retrieval-Augmented Generation | May 24, 2025 | BenchmarkingQuestion Answering | —Unverified | 0 |
| So-Fake: Benchmarking and Explaining Social Media Image Forgery Detection | May 24, 2025 | BenchmarkingImage Forgery Detection | —Unverified | 0 |
| MMMG: a Comprehensive and Reliable Evaluation Suite for Multitask Multimodal Generation | May 23, 2025 | Audio GenerationBenchmarking | —Unverified | 0 |
| Benchmark for Antibody Binding Affinity Maturation and Design | May 23, 2025 | Benchmarking | —Unverified | 0 |
| U2-BENCH: Benchmarking Large Vision-Language Models on Ultrasound Understanding | May 23, 2025 | BenchmarkingSpatial Reasoning | —Unverified | 0 |
| 3D Face Reconstruction Error Decomposed: A Modular Benchmark for Fair and Fast Method Evaluation | May 23, 2025 | 3D Face ReconstructionBenchmarking | CodeCode Available | 0 |
| A Position Paper on the Automatic Generation of Machine Learning Leaderboards | May 23, 2025 | BenchmarkingPosition | CodeCode Available | 0 |
| SemSegBench & DetecBench: Benchmarking Reliability and Generalization Beyond Classification | May 23, 2025 | BenchmarkingClassification | CodeCode Available | 0 |
| PawPrint: Whose Footprints Are These? Identifying Animal Individuals by Their Footprints | May 23, 2025 | Benchmarking | —Unverified | 0 |
| PerMedCQA: Benchmarking Large Language Models on Medical Consumer Question Answering in Persian Language | May 23, 2025 | BenchmarkingQuestion Answering | —Unverified | 0 |
| FullFront: Benchmarking MLLMs Across the Full Front-End Engineering Workflow | May 23, 2025 | BenchmarkingCode Generation | CodeCode Available | 1 |