| Is Single-View Mesh Reconstruction Ready for Robotics? | May 23, 2025 | 3D ReconstructionBenchmarking | —Unverified | 0 |
| Chart-to-Experience: Benchmarking Multimodal LLMs for Predicting Experiential Impact of Charts | May 23, 2025 | Benchmarking | —Unverified | 0 |
| JALMBench: Benchmarking Jailbreak Vulnerabilities in Audio Language Models | May 23, 2025 | BenchmarkingDiversity | CodeCode Available | 0 |
| SEvoBench : A C++ Framework For Evolutionary Single-Objective Optimization Benchmarking | May 23, 2025 | BenchmarkingComputational Efficiency | —Unverified | 0 |
| Twin-2K-500: A dataset for building digital twins of over 2,000 people based on their answers to over 500 questions | May 23, 2025 | 2kBenchmarking | CodeCode Available | 1 |
| Semantic Correspondence: Unified Benchmarking and a Strong Baseline | May 23, 2025 | BenchmarkingSemantic correspondence | CodeCode Available | 1 |
| Benchmarking Recommendation, Classification, and Tracing Based on Hugging Face Knowledge Graph | May 23, 2025 | BenchmarkingManagement | CodeCode Available | 1 |
| Wildfire spread forecasting with Deep Learning | May 23, 2025 | BenchmarkingDeep Learning | CodeCode Available | 0 |
| DailyQA: A Benchmark to Evaluate Web Retrieval Augmented LLMs Based on Capturing Real-World Changes | May 22, 2025 | BenchmarkingRAG | —Unverified | 0 |
| Learning collective multi-cellular dynamics from temporal scRNA-seq via a transformer-enhanced Neural SDE | May 22, 2025 | BenchmarkingTime Series | CodeCode Available | 0 |
| Tropical Attention: Neural Algorithmic Reasoning for Combinatorial Algorithms | May 22, 2025 | Adversarial AttackBenchmarking | —Unverified | 0 |
| BAGELS: Benchmarking the Automated Generation and Extraction of Limitations from Scholarly Text | May 22, 2025 | BenchmarkingRAG | —Unverified | 0 |
| Benchmarking Expressive Japanese Character Text-to-Speech with VITS and Style-BERT-VITS2 | May 22, 2025 | BenchmarkingDialogue Generation | —Unverified | 0 |
| BioDSA-1K: Benchmarking Data Science Agents for Biomedical Research | May 22, 2025 | Benchmarking | —Unverified | 0 |
| IFEval-Audio: Benchmarking Instruction-Following Capability in Audio-based Large Language Models | May 22, 2025 | BenchmarkingInstruction Following | CodeCode Available | 3 |
| Benchmarking Retrieval-Augmented Multimomal Generation for Document Question Answering | May 22, 2025 | BenchmarkingEvidence Selection | CodeCode Available | 1 |
| MiLQ: Benchmarking IR Models for Bilingual Web Search with Mixed Language Queries | May 22, 2025 | BenchmarkingInformation Retrieval | —Unverified | 0 |
| Can AI Read Between The Lines? Benchmarking LLMs On Financial Nuance | May 22, 2025 | BenchmarkingPrompt Engineering | —Unverified | 0 |
| REOBench: Benchmarking Robustness of Earth Observation Foundation Models | May 22, 2025 | BenchmarkingContrastive Learning | CodeCode Available | 1 |
| CUB: Benchmarking Context Utilisation Techniques for Language Models | May 22, 2025 | BenchmarkingFact Checking | —Unverified | 0 |
| Benchmarking and Pushing the Multi-Bias Elimination Boundary of LLMs via Causal Effect Estimation-guided Debiasing | May 22, 2025 | Benchmarking | —Unverified | 0 |
| EduBench: A Comprehensive Benchmarking Dataset for Evaluating Large Language Models in Diverse Educational Scenarios | May 22, 2025 | Benchmarking | CodeCode Available | 1 |
| AGENTIF: Benchmarking Instruction Following of Large Language Models in Agentic Scenarios | May 22, 2025 | BenchmarkingInstruction Following | CodeCode Available | 1 |
| MMMR: Benchmarking Massive Multi-Modal Reasoning Tasks | May 22, 2025 | BenchmarkingSpatial Reasoning | —Unverified | 0 |
| Mechanistic Understanding and Mitigation of Language Confusion in English-Centric Large Language Models | May 22, 2025 | BenchmarkingLanguage Modeling | —Unverified | 0 |