| An Analyst-Inspector Framework for Evaluating Reproducibility of LLMs in Data Science | Feb 23, 2025 | BenchmarkingCode Generation | CodeCode Available | 0 |
| VidLBEval: Benchmarking and Mitigating Language Bias in Video-Involved LVLMs | Feb 23, 2025 | Benchmarking | —Unverified | 0 |
| Visual-RAG: Benchmarking Text-to-Image Retrieval Augmented Generation for Visual Knowledge Intensive Queries | Feb 23, 2025 | BenchmarkingImage Retrieval | CodeCode Available | 0 |
| BioMaze: Benchmarking and Enhancing Large Language Models for Biological Pathway Reasoning | Feb 23, 2025 | Benchmarking | CodeCode Available | 1 |
| Bridging vision language model (VLM) evaluation gaps with a framework for scalable and cost-effective benchmark generation | Feb 21, 2025 | BenchmarkingLanguage Modeling | —Unverified | 0 |
| Methods and Trends in Detecting Generated Images: A Comprehensive Review | Feb 21, 2025 | BenchmarkingDeepFake Detection | —Unverified | 0 |
| MHQA: A Diverse, Knowledge Intensive Mental Health Question Answering Challenge for Language Models | Feb 21, 2025 | BenchmarkingDiagnostic | —Unverified | 0 |
| Adversarial Prompt Evaluation: Systematic Benchmarking of Guardrails Against Prompt Input Attacks on LLMs | Feb 21, 2025 | Benchmarking | CodeCode Available | 1 |
| Para-Lane: Multi-Lane Dataset Registering Parallel Scans for Benchmarking Novel View Synthesis | Feb 21, 2025 | 3DGSAutonomous Driving | —Unverified | 0 |
| Benchmarking machine learning for bowel sound pattern classification from tabular features to pretrained models | Feb 21, 2025 | BenchmarkingDiagnostic | CodeCode Available | 0 |