| TIIF-Bench: How Does Your T2I Model Follow Your Instructions? | Jun 2, 2025 | BenchmarkingInstruction Following | —Unverified | 0 |
| Benchmarking Neural Speech Codec Intelligibility with SITool | Jun 2, 2025 | BenchmarkingDiagnostic | —Unverified | 0 |
| CVC: A Large-Scale Chinese Value Rule Corpus for Value Alignment of Large Language Models | Jun 2, 2025 | Benchmarking | CodeCode Available | 0 |
| ACCESS DENIED INC: The First Benchmark Environment for Sensitivity Awareness | Jun 1, 2025 | BenchmarkingManagement | CodeCode Available | 0 |
| MedBookVQA: A Systematic and Comprehensive Medical Benchmark Derived from Open-Access Book | Jun 1, 2025 | Benchmarking | CodeCode Available | 0 |
| ModuLM: Enabling Modular and Multimodal Molecular Relational Learning with Large Language Models | Jun 1, 2025 | BenchmarkingRelational Reasoning | —Unverified | 0 |
| The iNaturalist Sounds Dataset | May 31, 2025 | Benchmarking | —Unverified | 0 |
| Benchmarking Foundation Models for Zero-Shot Biometric Tasks | May 30, 2025 | AttributeBenchmarking | —Unverified | 0 |
| CaMMT: Benchmarking Culturally Aware Multimodal Machine Translation | May 30, 2025 | BenchmarkingMachine Translation | —Unverified | 0 |
| Breakpoint: Scalable evaluation of system-level reasoning in LLM code agents | May 30, 2025 | BenchmarkingCode Repair | —Unverified | 0 |
| PhySense: Principle-Based Physics Reasoning Benchmarking for Large Language Models | May 30, 2025 | Benchmarking | —Unverified | 0 |
| Beyond Atomic Geometry Representations in Materials Science: A Human-in-the-Loop Multimodal Framework | May 30, 2025 | Benchmarking | CodeCode Available | 0 |
| Automated Structured Radiology Report Generation | May 30, 2025 | Benchmarking | —Unverified | 0 |
| Geospatial Foundation Models to Enable Progress on Sustainable Development Goals | May 30, 2025 | BenchmarkingEarth Observation | —Unverified | 0 |
| Progressive Class-level Distillation | May 30, 2025 | BenchmarkingKnowledge Distillation | —Unverified | 0 |
| MetaFaith: Faithful Natural Language Uncertainty Expression in LLMs | May 30, 2025 | Benchmarking | CodeCode Available | 0 |
| Segmenting France Across Four Centuries | May 30, 2025 | BenchmarkingImage-to-Image Translation | CodeCode Available | 0 |
| SORCE: Small Object Retrieval in Complex Environments | May 30, 2025 | BenchmarkingImage Retrieval | CodeCode Available | 0 |
| GenSpace: Benchmarking Spatially-Aware Image Generation | May 30, 2025 | BenchmarkingImage Generation | —Unverified | 0 |
| PathGene: Benchmarking Driver Gene Mutations and Exon Prediction Using Multicenter Lung Cancer Histopathology Image Dataset | May 30, 2025 | BenchmarkingMultiple Instance Learning | CodeCode Available | 0 |
| Benchmarking Large Language Models for Cryptanalysis and Mismatched-Generalization | May 30, 2025 | BenchmarkingCryptanalysis | —Unverified | 0 |
| R2I-Bench: Benchmarking Reasoning-Driven Text-to-Image Generation | May 29, 2025 | BenchmarkingImage Generation | —Unverified | 0 |
| Is Your Model Fairly Certain? Uncertainty-Aware Fairness Evaluation for LLMs | May 29, 2025 | BenchmarkingFairness | CodeCode Available | 0 |
| LLM Performance for Code Generation on Noisy Tasks | May 29, 2025 | BenchmarkingCode Generation | CodeCode Available | 0 |
| MSQA: Benchmarking LLMs on Graduate-Level Materials Science Reasoning and Knowledge | May 29, 2025 | Benchmarking | —Unverified | 0 |
| SNS-Bench-VL: Benchmarking Multimodal Large Language Models in Social Networking Services | May 29, 2025 | BenchmarkingInformation Retrieval | CodeCode Available | 0 |
| Socratic-PRMBench: Benchmarking Process Reward Models with Systematic Reasoning Patterns | May 29, 2025 | Benchmarking | —Unverified | 0 |
| Joint Phase Shift Optimization and Precoder Selection for RIS-Assisted 5G NR MIMO Systems | May 29, 2025 | Benchmarking | —Unverified | 0 |
| Diagnosing and Addressing Pitfalls in KG-RAG Datasets: Toward More Reliable Benchmarking | May 29, 2025 | BenchmarkingGraph Question Answering | —Unverified | 0 |
| PGLearn -- An Open-Source Learning Toolkit for Optimal Power Flow | May 28, 2025 | Benchmarking | —Unverified | 0 |
| Characterizing Bias: Benchmarking Large Language Models in Simplified versus Traditional Chinese | May 28, 2025 | Benchmarking | CodeCode Available | 0 |
| Found in Translation: Measuring Multilingual LLM Consistency as Simple as Translate then Evaluate | May 28, 2025 | Benchmarking | —Unverified | 0 |
| HelixDesign-Binder: A Scalable Production-Grade Platform for Binder Design Built on HelixFold3 | May 28, 2025 | BenchmarkingEfficient Exploration | —Unverified | 0 |
| Can LLMs Deceive CLIP? Benchmarking Adversarial Compositionality of Pre-trained Multimodal Representation via Text Updates | May 28, 2025 | BenchmarkingDiversity | —Unverified | 0 |
| B-XAIC Dataset: Benchmarking Explainable AI for Graph Neural Networks Using Chemical Data | May 28, 2025 | BenchmarkingDrug Discovery | CodeCode Available | 0 |
| Benchmarking Abstract and Reasoning Abilities Through A Theoretical Perspective | May 28, 2025 | BenchmarkingMemorization | CodeCode Available | 0 |
| TabularQGAN: A Quantum Generative Model for Tabular Data | May 28, 2025 | BenchmarkingGenerative Adversarial Network | —Unverified | 0 |
| Jailbreak Distillation: Renewable Safety Benchmarking | May 28, 2025 | BenchmarkingDiversity | —Unverified | 0 |
| StarBASE-GP: Biologically-Guided Automated Machine Learning for Genotype-to-Phenotype Association Analysis | May 28, 2025 | Benchmarking | CodeCode Available | 0 |
| MEDAL: A Framework for Benchmarking LLMs as Multilingual Open-Domain Chatbots and Dialogue Evaluators | May 28, 2025 | BenchmarkingChatbot | CodeCode Available | 0 |
| Yambda-5B -- A Large-Scale Multi-modal Dataset for Ranking And Retrieval | May 28, 2025 | BenchmarkingRecommendation Systems | —Unverified | 0 |
| Fedivertex: a Graph Dataset based on Decentralized Social Networks for Trustworthy Machine Learning | May 27, 2025 | Benchmarking | CodeCode Available | 0 |
| Laparoscopic Image Desmoking Using the U-Net with New Loss Function and Integrated Differentiable Wiener Filter | May 27, 2025 | Benchmarking | CodeCode Available | 0 |
| VideoMarkBench: Benchmarking Robustness of Video Watermarking | May 27, 2025 | Benchmarking | CodeCode Available | 0 |
| SOSBENCH: Benchmarking Safety Alignment on Scientific Knowledge | May 27, 2025 | BenchmarkingMultiple-choice | —Unverified | 0 |
| Gauss-Ramanujan Functions: Constructions, Properties, and Applications in Communications and Signal Processing | May 27, 2025 | Benchmarking | —Unverified | 0 |
| MoE-Gyro: Self-Supervised Over-Range Reconstruction and Denoising for MEMS Gyroscopes | May 27, 2025 | BenchmarkingDenoising | —Unverified | 0 |
| AutoJudger: An Agent-Driven Framework for Efficient Benchmarking of MLLMs | May 27, 2025 | BenchmarkingQuestion Selection | CodeCode Available | 0 |
| DynamicVL: Benchmarking Multimodal Large Language Models for Dynamic City Understanding | May 27, 2025 | BenchmarkingChange Detection | —Unverified | 0 |
| FRAMES-VQA: Benchmarking Fine-Tuning Robustness across Multi-Modal Shifts in Visual Question Answering | May 27, 2025 | BenchmarkingQuestion Answering | CodeCode Available | 0 |