| LOB-Bench: Benchmarking Generative AI for Finance -- an Application to Limit Order Book Data | Feb 13, 2025 | BenchmarkingState Space Models | CodeCode Available | 1 |
| Do LLMs Recognize Your Preferences? Evaluating Personalized Preference Following in LLMs | Feb 13, 2025 | BenchmarkingRetrieval | CodeCode Available | 1 |
| Benchmarking Vision-Language Models on Optical Character Recognition in Dynamic Video Environments | Feb 10, 2025 | BenchmarkingOptical Character Recognition | CodeCode Available | 1 |
| Foundation Model of Electronic Medical Records for Adaptive Risk Estimation | Feb 10, 2025 | Benchmarking | CodeCode Available | 1 |
| ShiftySpeech: A Large-Scale Synthetic Speech Dataset with Distribution Shifts | Feb 8, 2025 | BenchmarkingSelf-Supervised Learning | CodeCode Available | 1 |
| An Extended Benchmarking of Multi-Agent Reinforcement Learning Algorithms in Complex Fully Cooperative Tasks | Feb 7, 2025 | BenchmarkingMulti-agent Reinforcement Learning | CodeCode Available | 1 |
| Large Language Models for Multi-Robot Systems: A Survey | Feb 6, 2025 | Action GenerationBenchmarking | CodeCode Available | 1 |
| PICBench: Benchmarking LLMs for Photonic Integrated Circuits Design | Feb 5, 2025 | BenchmarkingPrompt Engineering | CodeCode Available | 1 |
| MM-IQ: Benchmarking Human-Like Abstraction and Reasoning in Multimodal Models | Feb 2, 2025 | Benchmarking | CodeCode Available | 1 |
| HateBench: Benchmarking Hate Speech Detectors on LLM-Generated Content and Hate Campaigns | Jan 28, 2025 | Adversarial AttackBenchmarking | CodeCode Available | 1 |
| Enhancing Biomedical Relation Extraction with Directionality | Jan 23, 2025 | BenchmarkingDocument-level Relation Extraction | CodeCode Available | 1 |
| InsQABench: Benchmarking Chinese Insurance Domain Question Answering with Large Language Models | Jan 19, 2025 | BenchmarkingQuestion Answering | CodeCode Available | 1 |
| ToMATO: Verbalizing the Mental States of Role-Playing LLMs for Benchmarking Theory of Mind | Jan 15, 2025 | BenchmarkingMultiple-choice | CodeCode Available | 1 |
| Multimodal LLMs Can Reason about Aesthetics in Zero-Shot | Jan 15, 2025 | BenchmarkingHallucination | CodeCode Available | 1 |
| TimberVision: A Multi-Task Dataset and Framework for Log-Component Segmentation and Tracking in Autonomous Forestry Operations | Jan 13, 2025 | BenchmarkingDomain Adaptation | CodeCode Available | 1 |
| ZNO-Eval: Benchmarking reasoning capabilities of large language models in Ukrainian | Jan 12, 2025 | BenchmarkingMath | CodeCode Available | 1 |
| Retrieval-Augmented Dialogue Knowledge Aggregation for Expressive Conversational Speech Synthesis | Jan 11, 2025 | AttributeBenchmarking | CodeCode Available | 1 |
| DiffuSETS: 12-lead ECG Generation Conditioned on Clinical Text Reports and Patient-Specific Information | Jan 10, 2025 | BenchmarkingData Augmentation | CodeCode Available | 1 |
| VoxEval: Benchmarking the Knowledge Understanding Capabilities of End-to-End Spoken Language Models | Jan 9, 2025 | BenchmarkingMathematical Problem-Solving | CodeCode Available | 1 |
| Underwater Image Restoration Through a Prior Guided Hybrid Sense Approach and Extensive Benchmark Analysis | Jan 6, 2025 | BenchmarkingImage Enhancement | CodeCode Available | 1 |
| CySecBench: Generative AI-based CyberSecurity-focused Prompt Dataset for Benchmarking Large Language Models | Jan 2, 2025 | BenchmarkingComputer Security | CodeCode Available | 1 |
| TrajLearn: Trajectory Prediction Learning using Deep Generative Models | Dec 30, 2024 | Autonomous NavigationBenchmarking | CodeCode Available | 1 |
| SMAC-Hard: Enabling Mixed Opponent Strategy Script and Self-play on SMAC | Dec 23, 2024 | BenchmarkingMulti-agent Reinforcement Learning | CodeCode Available | 1 |
| On the Generalization Ability of Machine-Generated Text Detectors | Dec 23, 2024 | BenchmarkingMisinformation | CodeCode Available | 1 |
| Generative CKM Construction using Partially Observed Data with Diffusion Model | Dec 19, 2024 | Benchmarking | CodeCode Available | 1 |