| VisTai: Benchmarking Vision-Language Models for Traditional Chinese in Taiwan | Mar 13, 2025 | BenchmarkingDialogue Generation | CodeCode Available | 1 |
| CASTLE: Benchmarking Dataset for Static Code Analyzers and LLMs towards CWE Detection | Mar 12, 2025 | BenchmarkingCode Classification | CodeCode Available | 1 |
| Illuminating Darkness: Enhancing Real-world Low-light Scenes with Smartphone Images | Mar 10, 2025 | 4kBenchmarking | CodeCode Available | 1 |
| DependEval: Benchmarking LLMs for Repository Dependency Understanding | Mar 9, 2025 | BenchmarkingCode Generation | CodeCode Available | 1 |
| FedMABench: Benchmarking Mobile Agents on Decentralized Heterogeneous User Data | Mar 7, 2025 | BenchmarkingFederated Learning | CodeCode Available | 1 |
| UnPuzzle: A Unified Framework for Pathology Image Analysis | Mar 5, 2025 | BenchmarkingDiagnostic | CodeCode Available | 1 |
| One ruler to measure them all: Benchmarking multilingual long-context language models | Mar 3, 2025 | 8kAll | CodeCode Available | 1 |
| AutoAdvExBench: Benchmarking autonomous exploitation of adversarial example defenses | Mar 3, 2025 | Benchmarking | CodeCode Available | 1 |
| From Claims to Evidence: A Unified Framework and Critical Analysis of CNN vs. Transformer vs. Mamba in Medical Image Segmentation | Mar 3, 2025 | BenchmarkingComputational Efficiency | CodeCode Available | 1 |
| Delving into Out-of-Distribution Detection with Medical Vision-Language Models | Mar 2, 2025 | Benchmarkingimage-classification | CodeCode Available | 1 |
| Protein Structure Tokenization: Benchmarking and New Recipe | Feb 28, 2025 | BenchmarkingLanguage Modeling | CodeCode Available | 1 |
| LexRAG: Benchmarking Retrieval-Augmented Generation in Multi-Turn Legal Consultation Conversation | Feb 28, 2025 | ArticlesBenchmarking | CodeCode Available | 1 |
| EgoNormia: Benchmarking Physical Social Norm Understanding | Feb 27, 2025 | Answer GenerationBenchmarking | CodeCode Available | 1 |
| Collab-Overcooked: Benchmarking and Evaluating Large Language Models as Collaborative Agents | Feb 27, 2025 | Benchmarking | CodeCode Available | 1 |
| Generalizable deep learning for photoplethysmography-based blood pressure estimation -- A Benchmarking Study | Feb 26, 2025 | BenchmarkingBlood pressure estimation | CodeCode Available | 1 |
| CodeIF: Benchmarking the Instruction-Following Capabilities of Large Language Models for Code Generation | Feb 26, 2025 | BenchmarkingCode Generation | CodeCode Available | 1 |
| Exploring Graph Tasks with Pure LLMs: A Comprehensive Benchmark and Investigation | Feb 26, 2025 | BenchmarkingGraph Learning | CodeCode Available | 1 |
| Problem Solved? Information Extraction Design Space for Layout-Rich Documents using LLMs | Feb 25, 2025 | BenchmarkingChunking | CodeCode Available | 1 |
| BioMaze: Benchmarking and Enhancing Large Language Models for Biological Pathway Reasoning | Feb 23, 2025 | Benchmarking | CodeCode Available | 1 |
| Adversarial Prompt Evaluation: Systematic Benchmarking of Guardrails Against Prompt Input Attacks on LLMs | Feb 21, 2025 | Benchmarking | CodeCode Available | 1 |
| Benchmarking LLMs for Political Science: A United Nations Perspective | Feb 19, 2025 | BenchmarkingDecision Making | CodeCode Available | 1 |
| Reinforcement Learning for Dynamic Resource Allocation in Optical Networks: Hype or Hope? | Feb 18, 2025 | BenchmarkingBlocking | CodeCode Available | 1 |
| Positional Encoding in Transformer-Based Time Series Models: A Survey | Feb 17, 2025 | Anomaly DetectionBenchmarking | CodeCode Available | 1 |
| ILIAS: Instance-Level Image retrieval At Scale | Feb 17, 2025 | BenchmarkingImage Retrieval | CodeCode Available | 1 |
| HintsOfTruth: A Multimodal Checkworthiness Detection Dataset with Real and Synthetic Claims | Feb 17, 2025 | BenchmarkingFact Checking | CodeCode Available | 1 |