| An OpenMind for 3D medical vision self-supervised learning | Dec 22, 2024 | BenchmarkingSelf-Supervised Learning | CodeCode Available | 2 |
| First-frame Supervised Video Polyp Segmentation via Propagative and Semantic Dual-teacher Network | Dec 21, 2024 | BenchmarkingTransfer Learning | CodeCode Available | 0 |
| HammerBench: Fine-Grained Function-Calling Evaluation in Real Mobile Device Scenarios | Dec 21, 2024 | Benchmarking | CodeCode Available | 0 |
| Patherea: Cell Detection and Classification for the 2020s | Dec 21, 2024 | BenchmarkingCell Detection | —Unverified | 0 |
| A Classification Benchmark for Artificial Intelligence Detection of Laryngeal Cancer from Patient Voice | Dec 20, 2024 | BenchmarkingDiagnostic | CodeCode Available | 0 |
| Toward Robust Hyper-Detailed Image Captioning: A Multiagent Approach and Dual Evaluation Metrics for Factuality and Coverage | Dec 20, 2024 | AttributeBenchmarking | —Unverified | 0 |
| Enriching Social Science Research via Survey Item Linking | Dec 20, 2024 | BenchmarkingEntity Disambiguation | CodeCode Available | 0 |
| Benchmarking LLMs and SLMs for patient reported outcomes | Dec 20, 2024 | BenchmarkingPrivacy Preserving | —Unverified | 0 |
| Deciphering the Underserved: Benchmarking LLM OCR for Low-Resource Scripts | Dec 20, 2024 | BenchmarkingOptical Character Recognition | CodeCode Available | 0 |
| AI-generated Image Quality Assessment in Visual Communication | Dec 20, 2024 | BenchmarkingImage Quality Assessment | CodeCode Available | 0 |
| XRAG: eXamining the Core -- Benchmarking Foundational Components in Advanced Retrieval-Augmented Generation | Dec 20, 2024 | BenchmarkingDiagnostic | CodeCode Available | 2 |
| TelcoLM: collecting data, adapting, and benchmarking language models for the telecommunication domain | Dec 20, 2024 | Benchmarking | —Unverified | 0 |
| Generative CKM Construction using Partially Observed Data with Diffusion Model | Dec 19, 2024 | Benchmarking | CodeCode Available | 1 |
| TOMG-Bench: Evaluating LLMs on Text-based Open Molecule Generation | Dec 19, 2024 | BenchmarkingDescription-guided molecule generation | CodeCode Available | 1 |
| Pitfalls of topology-aware image segmentation | Dec 19, 2024 | BenchmarkingImage Segmentation | —Unverified | 0 |
| AutoTrust: Benchmarking Trustworthiness in Large Vision Language Models for Autonomous Driving | Dec 19, 2024 | Autonomous DrivingBenchmarking | CodeCode Available | 2 |
| Autonomous Microscopy Experiments through Large Language Model Agents | Dec 18, 2024 | BenchmarkingExperimental Design | CodeCode Available | 1 |
| Mind Your Theory: Theory of Mind Goes Deeper Than Reasoning | Dec 18, 2024 | BenchmarkingPosition | —Unverified | 0 |
| AntiLeak-Bench: Preventing Data Contamination by Automatically Constructing Benchmarks with Updated Real-World Knowledge | Dec 18, 2024 | BenchmarkingWorld Knowledge | CodeCode Available | 0 |
| Benchmarking and Improving Large Vision-Language Models for Fundamental Visual Graph Understanding and Reasoning | Dec 18, 2024 | BenchmarkingGraph Learning | CodeCode Available | 1 |
| TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks | Dec 18, 2024 | Benchmarking | CodeCode Available | 1 |
| Open Universal Arabic ASR Leaderboard | Dec 18, 2024 | Benchmarking | CodeCode Available | 2 |
| Generation of Large District Heating System Models Using Open-Source Data and Tools: An Exemplary Workflow | Dec 18, 2024 | Benchmarking | —Unverified | 0 |
| RAG-RewardBench: Benchmarking Reward Models in Retrieval Augmented Generation for Preference Alignment | Dec 18, 2024 | BenchmarkingRAG | CodeCode Available | 1 |
| DateLogicQA: Benchmarking Temporal Biases in Large Language Models | Dec 17, 2024 | Benchmarking | CodeCode Available | 0 |