| Transformers in Protein: A Survey | May 26, 2025 | BenchmarkingDrug Discovery | —Unverified | 0 |
| TDVE-Assessor: Benchmarking and Evaluating the Quality of Text-Driven Video Editing with LMMs | May 26, 2025 | BenchmarkingLarge Language Model | —Unverified | 0 |
| PathBench: A comprehensive comparison benchmark for pathology foundation models towards precision oncology | May 26, 2025 | BenchmarkingPrognosis | —Unverified | 0 |
| Benchmarking and Enhancing LLM Agents in Localizing Linux Kernel Bugs | May 26, 2025 | BenchmarkingFault localization | CodeCode Available | 0 |
| Calibrating Pre-trained Language Classifiers on LLM-generated Noisy Labels via Iterative Refinement | May 26, 2025 | Benchmarking | CodeCode Available | 0 |
| FinLoRA: Benchmarking LoRA Methods for Fine-Tuning LLMs on Financial Datasets | May 26, 2025 | BenchmarkingGPU | —Unverified | 0 |
| AMQA: An Adversarial Dataset for Benchmarking Bias of LLMs in Medicine and Healthcare | May 26, 2025 | BenchmarkingMedical Diagnosis | CodeCode Available | 0 |
| Beyond Specialization: Benchmarking LLMs for Transliteration of Indian Languages | May 26, 2025 | BenchmarkingTransliteration | —Unverified | 0 |
| Automated Text-to-Table for Reasoning-Intensive Table QA: Pipeline Design and Benchmarking Insights | May 26, 2025 | BenchmarkingQuestion Answering | CodeCode Available | 0 |
| EuroCon: Benchmarking Parliament Deliberation for Political Consensus Finding | May 26, 2025 | Benchmarking | —Unverified | 0 |
| Synthetic Time Series Forecasting with Transformer Architectures: Extensive Simulation Benchmarks | May 26, 2025 | BenchmarkingDecision Making Under Uncertainty | CodeCode Available | 0 |
| A Unified Solution to Video Fusion: From Multi-Frame Learning to Benchmarking | May 26, 2025 | BenchmarkingOptical Flow Estimation | —Unverified | 0 |
| StructEval: Benchmarking LLMs' Capabilities to Generate Structural Outputs | May 26, 2025 | Benchmarking | —Unverified | 0 |
| AgentRecBench: Benchmarking LLM Agent-based Personalized Recommender Systems | May 26, 2025 | BenchmarkingRecommendation Systems | —Unverified | 0 |
| Benchmarking Large Multimodal Models for Ophthalmic Visual Question Answering with OphthalWeChat | May 26, 2025 | BenchmarkingQuestion Answering | —Unverified | 0 |
| EnvSDD: Benchmarking Environmental Sound Deepfake Detection | May 25, 2025 | Audio Deepfake DetectionAudio Generation | —Unverified | 0 |
| DeepResearchGym: A Free, Transparent, and Reproducible Evaluation Sandbox for Deep Research | May 25, 2025 | BenchmarkingInformation Retrieval | —Unverified | 0 |
| SpokenNativQA: Multilingual Everyday Spoken Queries for LLMs | May 25, 2025 | BenchmarkingDiversity | —Unverified | 0 |
| Where Paths Collide: A Comprehensive Survey of Classic and Learning-Based Multi-Agent Pathfinding | May 25, 2025 | BenchmarkingMulti-Agent Path Finding | —Unverified | 0 |
| AssistedDS: Benchmarking How External Domain Knowledge Assists LLMs in Automated Data Science | May 25, 2025 | BenchmarkingFeature Engineering | —Unverified | 0 |
| Retrieval-Augmented Generation for Service Discovery: Chunking Strategies and Benchmarking | May 25, 2025 | BenchmarkingChunking | —Unverified | 0 |
| Benchmarking Large Language Models for Cyberbullying Detection in Real-World YouTube Comments | May 25, 2025 | Benchmarking | —Unverified | 0 |
| SPDEBench: An Extensive Benchmark for Learning Regular and Singular Stochastic PDEs | May 24, 2025 | Benchmarking | CodeCode Available | 0 |
| Benchmarking and Rethinking Knowledge Editing for Large Language Models | May 24, 2025 | Benchmarkingknowledge editing | CodeCode Available | 0 |
| Business as Rulesual: A Benchmark and Framework for Business Rule Flow Modeling with LLMs | May 24, 2025 | Benchmarking | —Unverified | 0 |
| So-Fake: Benchmarking and Explaining Social Media Image Forgery Detection | May 24, 2025 | BenchmarkingImage Forgery Detection | —Unverified | 0 |
| Towards Emotionally Consistent Text-Based Speech Editing: Introducing EmoCorrector and The ECD-TSE Dataset | May 24, 2025 | BenchmarkingRAG | CodeCode Available | 0 |
| Benchmarking Poisoning Attacks against Retrieval-Augmented Generation | May 24, 2025 | BenchmarkingQuestion Answering | —Unverified | 0 |
| From Generation to Detection: A Multimodal Multi-Task Dataset for Benchmarking Health Misinformation | May 24, 2025 | ArticlesBenchmarking | —Unverified | 0 |
| LogicCat: A Chain-of-Thought Text-to-SQL Benchmark for Multi-Domain Reasoning Challenges | May 24, 2025 | BenchmarkingMathematical Reasoning | CodeCode Available | 0 |
| SAMA: Towards Multi-Turn Referential Grounded Video Chat with Large Language Models | May 24, 2025 | BenchmarkingVideo Grounding | —Unverified | 0 |
| A Position Paper on the Automatic Generation of Machine Learning Leaderboards | May 23, 2025 | BenchmarkingPosition | CodeCode Available | 0 |
| SEvoBench : A C++ Framework For Evolutionary Single-Objective Optimization Benchmarking | May 23, 2025 | BenchmarkingComputational Efficiency | —Unverified | 0 |
| Wildfire spread forecasting with Deep Learning | May 23, 2025 | BenchmarkingDeep Learning | CodeCode Available | 0 |
| PerMedCQA: Benchmarking Large Language Models on Medical Consumer Question Answering in Persian Language | May 23, 2025 | BenchmarkingQuestion Answering | —Unverified | 0 |
| Chart-to-Experience: Benchmarking Multimodal LLMs for Predicting Experiential Impact of Charts | May 23, 2025 | Benchmarking | —Unverified | 0 |
| SemSegBench & DetecBench: Benchmarking Reliability and Generalization Beyond Classification | May 23, 2025 | BenchmarkingClassification | CodeCode Available | 0 |
| Benchmark for Antibody Binding Affinity Maturation and Design | May 23, 2025 | Benchmarking | —Unverified | 0 |
| MMMG: a Comprehensive and Reliable Evaluation Suite for Multitask Multimodal Generation | May 23, 2025 | Audio GenerationBenchmarking | —Unverified | 0 |
| U2-BENCH: Benchmarking Large Vision-Language Models on Ultrasound Understanding | May 23, 2025 | BenchmarkingSpatial Reasoning | —Unverified | 0 |
| 3D Face Reconstruction Error Decomposed: A Modular Benchmark for Fair and Fast Method Evaluation | May 23, 2025 | 3D Face ReconstructionBenchmarking | CodeCode Available | 0 |
| Is Single-View Mesh Reconstruction Ready for Robotics? | May 23, 2025 | 3D ReconstructionBenchmarking | —Unverified | 0 |
| JALMBench: Benchmarking Jailbreak Vulnerabilities in Audio Language Models | May 23, 2025 | BenchmarkingDiversity | CodeCode Available | 0 |
| PawPrint: Whose Footprints Are These? Identifying Animal Individuals by Their Footprints | May 23, 2025 | Benchmarking | —Unverified | 0 |
| Benchmarking Expressive Japanese Character Text-to-Speech with VITS and Style-BERT-VITS2 | May 22, 2025 | BenchmarkingDialogue Generation | —Unverified | 0 |
| Experimental robustness benchmark of quantum neural network on a superconducting quantum processor | May 22, 2025 | Adversarial AttackAdversarial Robustness | —Unverified | 0 |
| Mechanistic Understanding and Mitigation of Language Confusion in English-Centric Large Language Models | May 22, 2025 | BenchmarkingLanguage Modeling | —Unverified | 0 |
| Edge-First Language Model Inference: Models, Metrics, and Tradeoffs | May 22, 2025 | BenchmarkingLanguage Modeling | —Unverified | 0 |
| KRIS-Bench: Benchmarking Next-Level Intelligent Image Editing Models | May 22, 2025 | BenchmarkingDiagnostic | —Unverified | 0 |
| Benchmarking and Pushing the Multi-Bias Elimination Boundary of LLMs via Causal Effect Estimation-guided Debiasing | May 22, 2025 | Benchmarking | —Unverified | 0 |