| Is Single-View Mesh Reconstruction Ready for Robotics? | May 23, 2025 | 3D ReconstructionBenchmarking | —Unverified | 0 |
| Twin-2K-500: A dataset for building digital twins of over 2,000 people based on their answers to over 500 questions | May 23, 2025 | 2kBenchmarking | CodeCode Available | 1 |
| JALMBench: Benchmarking Jailbreak Vulnerabilities in Audio Language Models | May 23, 2025 | BenchmarkingDiversity | CodeCode Available | 0 |
| Chart-to-Experience: Benchmarking Multimodal LLMs for Predicting Experiential Impact of Charts | May 23, 2025 | Benchmarking | —Unverified | 0 |
| SEvoBench : A C++ Framework For Evolutionary Single-Objective Optimization Benchmarking | May 23, 2025 | BenchmarkingComputational Efficiency | —Unverified | 0 |
| Semantic Correspondence: Unified Benchmarking and a Strong Baseline | May 23, 2025 | BenchmarkingSemantic correspondence | CodeCode Available | 1 |
| Benchmarking Recommendation, Classification, and Tracing Based on Hugging Face Knowledge Graph | May 23, 2025 | BenchmarkingManagement | CodeCode Available | 1 |
| Wildfire spread forecasting with Deep Learning | May 23, 2025 | BenchmarkingDeep Learning | CodeCode Available | 0 |
| DailyQA: A Benchmark to Evaluate Web Retrieval Augmented LLMs Based on Capturing Real-World Changes | May 22, 2025 | BenchmarkingRAG | —Unverified | 0 |
| Learning collective multi-cellular dynamics from temporal scRNA-seq via a transformer-enhanced Neural SDE | May 22, 2025 | BenchmarkingTime Series | CodeCode Available | 0 |
| Tropical Attention: Neural Algorithmic Reasoning for Combinatorial Algorithms | May 22, 2025 | Adversarial AttackBenchmarking | —Unverified | 0 |
| Benchmarking Expressive Japanese Character Text-to-Speech with VITS and Style-BERT-VITS2 | May 22, 2025 | BenchmarkingDialogue Generation | —Unverified | 0 |
| BAGELS: Benchmarking the Automated Generation and Extraction of Limitations from Scholarly Text | May 22, 2025 | BenchmarkingRAG | —Unverified | 0 |
| KRIS-Bench: Benchmarking Next-Level Intelligent Image Editing Models | May 22, 2025 | BenchmarkingDiagnostic | —Unverified | 0 |
| CUB: Benchmarking Context Utilisation Techniques for Language Models | May 22, 2025 | BenchmarkingFact Checking | —Unverified | 0 |
| IFEval-Audio: Benchmarking Instruction-Following Capability in Audio-based Large Language Models | May 22, 2025 | BenchmarkingInstruction Following | CodeCode Available | 3 |
| Mechanistic Understanding and Mitigation of Language Confusion in English-Centric Large Language Models | May 22, 2025 | BenchmarkingLanguage Modeling | —Unverified | 0 |
| MMMR: Benchmarking Massive Multi-Modal Reasoning Tasks | May 22, 2025 | BenchmarkingSpatial Reasoning | —Unverified | 0 |
| Benchmarking and Pushing the Multi-Bias Elimination Boundary of LLMs via Causal Effect Estimation-guided Debiasing | May 22, 2025 | Benchmarking | —Unverified | 0 |
| Benchmarking Retrieval-Augmented Multimomal Generation for Document Question Answering | May 22, 2025 | BenchmarkingEvidence Selection | CodeCode Available | 1 |
| Can AI Read Between The Lines? Benchmarking LLMs On Financial Nuance | May 22, 2025 | BenchmarkingPrompt Engineering | —Unverified | 0 |
| BioDSA-1K: Benchmarking Data Science Agents for Biomedical Research | May 22, 2025 | Benchmarking | —Unverified | 0 |
| EduBench: A Comprehensive Benchmarking Dataset for Evaluating Large Language Models in Diverse Educational Scenarios | May 22, 2025 | Benchmarking | CodeCode Available | 1 |
| REOBench: Benchmarking Robustness of Earth Observation Foundation Models | May 22, 2025 | BenchmarkingContrastive Learning | CodeCode Available | 1 |
| MiLQ: Benchmarking IR Models for Bilingual Web Search with Mixed Language Queries | May 22, 2025 | BenchmarkingInformation Retrieval | —Unverified | 0 |
| AGENTIF: Benchmarking Instruction Following of Large Language Models in Agentic Scenarios | May 22, 2025 | BenchmarkingInstruction Following | CodeCode Available | 1 |
| Zero-Shot Hyperspectral Pansharpening Using Hysteresis-Based Tuning for Spectral Quality Control | May 22, 2025 | BenchmarkingPansharpening | CodeCode Available | 0 |
| When Safety Detectors Aren't Enough: A Stealthy and Effective Jailbreak Attack on LLMs via Steganographic Techniques | May 22, 2025 | Benchmarking | —Unverified | 0 |
| Experimental robustness benchmark of quantum neural network on a superconducting quantum processor | May 22, 2025 | Adversarial AttackAdversarial Robustness | —Unverified | 0 |
| Edge-First Language Model Inference: Models, Metrics, and Tradeoffs | May 22, 2025 | BenchmarkingLanguage Modeling | —Unverified | 0 |
| AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large Language Models | May 22, 2025 | BenchmarkingFairness | CodeCode Available | 3 |
| SIMCOPILOT: Evaluating Large Language Models for Copilot-Style Code Generation | May 21, 2025 | BenchmarkingCode Generation | —Unverified | 0 |
| NEXT-EVAL: Next Evaluation of Traditional and LLM Web Data Record Extraction | May 21, 2025 | BenchmarkingHallucination | —Unverified | 0 |
| Benchmarking Chest X-ray Diagnosis Models Across Multinational Datasets | May 21, 2025 | BenchmarkingDiagnostic | —Unverified | 0 |
| InfoDeepSeek: Benchmarking Agentic Information Seeking for Retrieval-Augmented Generation | May 21, 2025 | BenchmarkingRAG | —Unverified | 0 |
| VocalBench: Benchmarking the Vocal Conversational Abilities for Speech Interaction Models | May 21, 2025 | Benchmarking | CodeCode Available | 0 |
| Keep Security! Benchmarking Security Policy Preservation in Large Language Model Contexts Against Indirect Attacks in Question Answering | May 21, 2025 | BenchmarkingLanguage Modeling | CodeCode Available | 0 |
| AI vs. Human Judgment of Content Moderation: LLM-as-a-Judge and Ethics-Based Response Refusals | May 21, 2025 | BenchmarkingChatbot | —Unverified | 0 |
| UrduFactCheck: An Agentic Fact-Checking Framework for Urdu with Evidence Boosting and Benchmarking | May 21, 2025 | BenchmarkingClaim Verification | CodeCode Available | 0 |
| UAV-Flow Colosseo: A Real-World Benchmark for Flying-on-a-Word UAV Imitation Learning | May 21, 2025 | BenchmarkingImitation Learning | —Unverified | 0 |
| Towards Spoken Mathematical Reasoning: Benchmarking Speech-based Models over Multi-faceted Math Problems | May 21, 2025 | BenchmarkingMath | —Unverified | 0 |
| Traveling Across Languages: Benchmarking Cross-Lingual Consistency in Multimodal LLMs | May 21, 2025 | BenchmarkingQuestion Answering | CodeCode Available | 0 |
| Benchmarking Energy and Latency in TinyML: A Novel Method for Resource-Constrained AI | May 21, 2025 | Benchmarking | —Unverified | 0 |
| Towards Zero-Shot Differential Morphing Attack Detection with Multimodal Large Language Models | May 21, 2025 | BenchmarkingPrompt Engineering | —Unverified | 0 |
| VerifyBench: Benchmarking Reference-based Reward Systems for Large Language Models | May 21, 2025 | BenchmarkingReinforcement Learning (RL) | —Unverified | 0 |
| Lost in Benchmarks? Rethinking Large Language Model Benchmarking with Item Response Theory | May 21, 2025 | BenchmarkingLanguage Modeling | CodeCode Available | 0 |
| Oral Imaging for Malocclusion Issues Assessments: OMNI Dataset, Deep Learning Baselines and Benchmarking | May 21, 2025 | BenchmarkingDiagnostic | CodeCode Available | 0 |
| A Risk Taxonomy for Evaluating AI-Powered Psychotherapy Agents | May 21, 2025 | BenchmarkingDecompensation | —Unverified | 0 |
| Guidelines for the Quality Assessment of Energy-Aware NAS Benchmarks | May 21, 2025 | BenchmarkingGPU | —Unverified | 0 |
| DECASTE: Unveiling Caste Stereotypes in Large Language Models through Multi-Dimensional Bias Analysis | May 20, 2025 | BenchmarkingFairness | —Unverified | 0 |