| VisTai: Benchmarking Vision-Language Models for Traditional Chinese in Taiwan | Mar 13, 2025 | BenchmarkingDialogue Generation | CodeCode Available | 1 |
| CASTLE: Benchmarking Dataset for Static Code Analyzers and LLMs towards CWE Detection | Mar 12, 2025 | BenchmarkingCode Classification | CodeCode Available | 1 |
| Illuminating Darkness: Enhancing Real-world Low-light Scenes with Smartphone Images | Mar 10, 2025 | 4kBenchmarking | CodeCode Available | 1 |
| DependEval: Benchmarking LLMs for Repository Dependency Understanding | Mar 9, 2025 | BenchmarkingCode Generation | CodeCode Available | 1 |
| FedMABench: Benchmarking Mobile Agents on Decentralized Heterogeneous User Data | Mar 7, 2025 | BenchmarkingFederated Learning | CodeCode Available | 1 |
| UnPuzzle: A Unified Framework for Pathology Image Analysis | Mar 5, 2025 | BenchmarkingDiagnostic | CodeCode Available | 1 |
| AutoAdvExBench: Benchmarking autonomous exploitation of adversarial example defenses | Mar 3, 2025 | Benchmarking | CodeCode Available | 1 |
| One ruler to measure them all: Benchmarking multilingual long-context language models | Mar 3, 2025 | 8kAll | CodeCode Available | 1 |
| From Claims to Evidence: A Unified Framework and Critical Analysis of CNN vs. Transformer vs. Mamba in Medical Image Segmentation | Mar 3, 2025 | BenchmarkingComputational Efficiency | CodeCode Available | 1 |
| Delving into Out-of-Distribution Detection with Medical Vision-Language Models | Mar 2, 2025 | Benchmarkingimage-classification | CodeCode Available | 1 |
| Protein Structure Tokenization: Benchmarking and New Recipe | Feb 28, 2025 | BenchmarkingLanguage Modeling | CodeCode Available | 1 |
| LexRAG: Benchmarking Retrieval-Augmented Generation in Multi-Turn Legal Consultation Conversation | Feb 28, 2025 | ArticlesBenchmarking | CodeCode Available | 1 |
| Collab-Overcooked: Benchmarking and Evaluating Large Language Models as Collaborative Agents | Feb 27, 2025 | Benchmarking | CodeCode Available | 1 |
| EgoNormia: Benchmarking Physical Social Norm Understanding | Feb 27, 2025 | Answer GenerationBenchmarking | CodeCode Available | 1 |
| Generalizable deep learning for photoplethysmography-based blood pressure estimation -- A Benchmarking Study | Feb 26, 2025 | BenchmarkingBlood pressure estimation | CodeCode Available | 1 |
| CodeIF: Benchmarking the Instruction-Following Capabilities of Large Language Models for Code Generation | Feb 26, 2025 | BenchmarkingCode Generation | CodeCode Available | 1 |
| Exploring Graph Tasks with Pure LLMs: A Comprehensive Benchmark and Investigation | Feb 26, 2025 | BenchmarkingGraph Learning | CodeCode Available | 1 |
| Problem Solved? Information Extraction Design Space for Layout-Rich Documents using LLMs | Feb 25, 2025 | BenchmarkingChunking | CodeCode Available | 1 |
| BioMaze: Benchmarking and Enhancing Large Language Models for Biological Pathway Reasoning | Feb 23, 2025 | Benchmarking | CodeCode Available | 1 |
| Adversarial Prompt Evaluation: Systematic Benchmarking of Guardrails Against Prompt Input Attacks on LLMs | Feb 21, 2025 | Benchmarking | CodeCode Available | 1 |
| Benchmarking LLMs for Political Science: A United Nations Perspective | Feb 19, 2025 | BenchmarkingDecision Making | CodeCode Available | 1 |
| Reinforcement Learning for Dynamic Resource Allocation in Optical Networks: Hype or Hope? | Feb 18, 2025 | BenchmarkingBlocking | CodeCode Available | 1 |
| ILIAS: Instance-Level Image retrieval At Scale | Feb 17, 2025 | BenchmarkingImage Retrieval | CodeCode Available | 1 |
| Positional Encoding in Transformer-Based Time Series Models: A Survey | Feb 17, 2025 | Anomaly DetectionBenchmarking | CodeCode Available | 1 |
| HintsOfTruth: A Multimodal Checkworthiness Detection Dataset with Real and Synthetic Claims | Feb 17, 2025 | BenchmarkingFact Checking | CodeCode Available | 1 |
| LOB-Bench: Benchmarking Generative AI for Finance -- an Application to Limit Order Book Data | Feb 13, 2025 | BenchmarkingState Space Models | CodeCode Available | 1 |
| Do LLMs Recognize Your Preferences? Evaluating Personalized Preference Following in LLMs | Feb 13, 2025 | BenchmarkingRetrieval | CodeCode Available | 1 |
| Benchmarking Vision-Language Models on Optical Character Recognition in Dynamic Video Environments | Feb 10, 2025 | BenchmarkingOptical Character Recognition | CodeCode Available | 1 |
| Foundation Model of Electronic Medical Records for Adaptive Risk Estimation | Feb 10, 2025 | Benchmarking | CodeCode Available | 1 |
| ShiftySpeech: A Large-Scale Synthetic Speech Dataset with Distribution Shifts | Feb 8, 2025 | BenchmarkingSelf-Supervised Learning | CodeCode Available | 1 |
| An Extended Benchmarking of Multi-Agent Reinforcement Learning Algorithms in Complex Fully Cooperative Tasks | Feb 7, 2025 | BenchmarkingMulti-agent Reinforcement Learning | CodeCode Available | 1 |
| Large Language Models for Multi-Robot Systems: A Survey | Feb 6, 2025 | Action GenerationBenchmarking | CodeCode Available | 1 |
| PICBench: Benchmarking LLMs for Photonic Integrated Circuits Design | Feb 5, 2025 | BenchmarkingPrompt Engineering | CodeCode Available | 1 |
| MM-IQ: Benchmarking Human-Like Abstraction and Reasoning in Multimodal Models | Feb 2, 2025 | Benchmarking | CodeCode Available | 1 |
| HateBench: Benchmarking Hate Speech Detectors on LLM-Generated Content and Hate Campaigns | Jan 28, 2025 | Adversarial AttackBenchmarking | CodeCode Available | 1 |
| Enhancing Biomedical Relation Extraction with Directionality | Jan 23, 2025 | BenchmarkingDocument-level Relation Extraction | CodeCode Available | 1 |
| InsQABench: Benchmarking Chinese Insurance Domain Question Answering with Large Language Models | Jan 19, 2025 | BenchmarkingQuestion Answering | CodeCode Available | 1 |
| ToMATO: Verbalizing the Mental States of Role-Playing LLMs for Benchmarking Theory of Mind | Jan 15, 2025 | BenchmarkingMultiple-choice | CodeCode Available | 1 |
| Multimodal LLMs Can Reason about Aesthetics in Zero-Shot | Jan 15, 2025 | BenchmarkingHallucination | CodeCode Available | 1 |
| TimberVision: A Multi-Task Dataset and Framework for Log-Component Segmentation and Tracking in Autonomous Forestry Operations | Jan 13, 2025 | BenchmarkingDomain Adaptation | CodeCode Available | 1 |
| ZNO-Eval: Benchmarking reasoning capabilities of large language models in Ukrainian | Jan 12, 2025 | BenchmarkingMath | CodeCode Available | 1 |
| Retrieval-Augmented Dialogue Knowledge Aggregation for Expressive Conversational Speech Synthesis | Jan 11, 2025 | AttributeBenchmarking | CodeCode Available | 1 |
| DiffuSETS: 12-lead ECG Generation Conditioned on Clinical Text Reports and Patient-Specific Information | Jan 10, 2025 | BenchmarkingData Augmentation | CodeCode Available | 1 |
| VoxEval: Benchmarking the Knowledge Understanding Capabilities of End-to-End Spoken Language Models | Jan 9, 2025 | BenchmarkingMathematical Problem-Solving | CodeCode Available | 1 |
| Underwater Image Restoration Through a Prior Guided Hybrid Sense Approach and Extensive Benchmark Analysis | Jan 6, 2025 | BenchmarkingImage Enhancement | CodeCode Available | 1 |
| CySecBench: Generative AI-based CyberSecurity-focused Prompt Dataset for Benchmarking Large Language Models | Jan 2, 2025 | BenchmarkingComputer Security | CodeCode Available | 1 |
| TrajLearn: Trajectory Prediction Learning using Deep Generative Models | Dec 30, 2024 | Autonomous NavigationBenchmarking | CodeCode Available | 1 |
| SMAC-Hard: Enabling Mixed Opponent Strategy Script and Self-play on SMAC | Dec 23, 2024 | BenchmarkingMulti-agent Reinforcement Learning | CodeCode Available | 1 |
| On the Generalization Ability of Machine-Generated Text Detectors | Dec 23, 2024 | BenchmarkingMisinformation | CodeCode Available | 1 |
| Generative CKM Construction using Partially Observed Data with Diffusion Model | Dec 19, 2024 | Benchmarking | CodeCode Available | 1 |