| Benchmarking Multimodal Knowledge Conflict for Large Multimodal Models | May 26, 2025 | BenchmarkingRAG | CodeCode Available | 1 |
| SeePhys: Does Seeing Help Thinking? -- Benchmarking Vision-Based Physics Reasoning | May 25, 2025 | BenchmarkingVisual Reasoning | CodeCode Available | 1 |
| Are Vision Language Models Ready for Clinical Diagnosis? A 3D Medical Benchmark for Tumor-centric Visual Question Answering | May 25, 2025 | AnatomyBenchmarking | CodeCode Available | 1 |
| Twin-2K-500: A dataset for building digital twins of over 2,000 people based on their answers to over 500 questions | May 23, 2025 | 2kBenchmarking | CodeCode Available | 1 |
| Semantic Correspondence: Unified Benchmarking and a Strong Baseline | May 23, 2025 | BenchmarkingSemantic correspondence | CodeCode Available | 1 |
| Benchmarking Recommendation, Classification, and Tracing Based on Hugging Face Knowledge Graph | May 23, 2025 | BenchmarkingManagement | CodeCode Available | 1 |
| FullFront: Benchmarking MLLMs Across the Full Front-End Engineering Workflow | May 23, 2025 | BenchmarkingCode Generation | CodeCode Available | 1 |
| Benchmarking Retrieval-Augmented Multimomal Generation for Document Question Answering | May 22, 2025 | BenchmarkingEvidence Selection | CodeCode Available | 1 |
| REOBench: Benchmarking Robustness of Earth Observation Foundation Models | May 22, 2025 | BenchmarkingContrastive Learning | CodeCode Available | 1 |
| AGENTIF: Benchmarking Instruction Following of Large Language Models in Agentic Scenarios | May 22, 2025 | BenchmarkingInstruction Following | CodeCode Available | 1 |