| Collab-Overcooked: Benchmarking and Evaluating Large Language Models as Collaborative Agents | Feb 27, 2025 | Benchmarking | CodeCode Available | 1 |
| MathTutorBench: A Benchmark for Measuring Open-ended Pedagogical Capabilities of LLM Tutors | Feb 26, 2025 | Benchmarking | —Unverified | 0 |
| Medical Hallucinations in Foundation Models and Their Impact on Healthcare | Feb 26, 2025 | BenchmarkingHallucination | CodeCode Available | 2 |
| Improved YOLOv12 with LLM-Generated Synthetic Data for Enhanced Apple Detection and Benchmarking Against YOLOv11 and YOLOv10 | Feb 26, 2025 | Benchmarkingobject-detection | —Unverified | 0 |
| Agentic Mixture-of-Workflows for Multi-Modal Chemical Search | Feb 26, 2025 | BenchmarkingRetrieval | —Unverified | 0 |
| Is Your Paper Being Reviewed by an LLM? A New Benchmark Dataset and Approach for Detecting AI Text in Peer Review | Feb 26, 2025 | BenchmarkingText Detection | —Unverified | 0 |
| Generalizable deep learning for photoplethysmography-based blood pressure estimation -- A Benchmarking Study | Feb 26, 2025 | BenchmarkingBlood pressure estimation | CodeCode Available | 1 |
| Modelling Regional Solar Photovoltaic Capacity in Great Britain | Feb 26, 2025 | Benchmarking | —Unverified | 0 |
| CodeIF: Benchmarking the Instruction-Following Capabilities of Large Language Models for Code Generation | Feb 26, 2025 | BenchmarkingCode Generation | CodeCode Available | 1 |
| Exploring Graph Tasks with Pure LLMs: A Comprehensive Benchmark and Investigation | Feb 26, 2025 | BenchmarkingGraph Learning | CodeCode Available | 1 |
| MEBench: Benchmarking Large Language Models for Cross-Document Multi-Entity Question Answering | Feb 26, 2025 | BenchmarkingQuestion Answering | —Unverified | 0 |
| BatteryLife: A Comprehensive Dataset and Benchmark for Battery Life Prediction | Feb 26, 2025 | BenchmarkingTime Series | CodeCode Available | 3 |
| Isolating Language-Coding from Problem-Solving: Benchmarking LLMs with PseudoEval | Feb 26, 2025 | BenchmarkingCode Generation | —Unverified | 0 |
| Problem Solved? Information Extraction Design Space for Layout-Rich Documents using LLMs | Feb 25, 2025 | BenchmarkingChunking | CodeCode Available | 1 |
| Science Across Languages: Assessing LLM Multilingual Translation of Scientific Papers | Feb 25, 2025 | ArticlesBenchmarking | —Unverified | 0 |
| CayleyPy RL: Pathfinding and Reinforcement Learning on Cayley Graphs | Feb 25, 2025 | Benchmarkingreinforcement-learning | —Unverified | 0 |
| OpenFly: A Comprehensive Platform for Aerial Vision-Language Navigation | Feb 25, 2025 | BenchmarkingSemantic Segmentation | —Unverified | 0 |
| Safe Multi-Agent Navigation guided by Goal-Conditioned Safe Reinforcement Learning | Feb 25, 2025 | BenchmarkingReinforcement Learning (RL) | CodeCode Available | 0 |
| A Real-time Spatio-Temporal Trajectory Planner for Autonomous Vehicles with Semantic Graph Optimization | Feb 25, 2025 | Autonomous VehiclesBenchmarking | —Unverified | 0 |
| Overconfident Oracles: Limitations of In Silico Sequence Design Benchmarking | Feb 24, 2025 | Benchmarking | —Unverified | 0 |
| Enhancing Image Matting in Real-World Scenes with Mask-Guided Iterative Refinement | Feb 24, 2025 | Benchmarkingfeature selection | —Unverified | 0 |
| SynthRAD2025 Grand Challenge dataset: generating synthetic CTs for radiotherapy | Feb 24, 2025 | BenchmarkingImage Generation | —Unverified | 0 |
| Benchmarking Temporal Reasoning and Alignment Across Chinese Dynasties | Feb 24, 2025 | Benchmarking | CodeCode Available | 0 |
| MULTITAT: Benchmarking Multilingual Table-and-Text Question Answering | Feb 24, 2025 | BenchmarkingQuestion Answering | CodeCode Available | 0 |
| Benchmarking Retrieval-Augmented Generation in Multi-Modal Contexts | Feb 24, 2025 | BenchmarkingFact Verification | CodeCode Available | 2 |
| On Neural Inertial Classification Networks for Pedestrian Activity Recognition | Feb 23, 2025 | Activity RecognitionBenchmarking | —Unverified | 0 |
| Recent Advances in Large Langauge Model Benchmarks against Data Contamination: From Static to Dynamic Evaluation | Feb 23, 2025 | Benchmarking | CodeCode Available | 4 |
| Benchmarking Online Object Trackers for Underwater Robot Position Locking Applications | Feb 23, 2025 | BenchmarkingObject Tracking | —Unverified | 0 |
| VidLBEval: Benchmarking and Mitigating Language Bias in Video-Involved LVLMs | Feb 23, 2025 | Benchmarking | —Unverified | 0 |
| BioMaze: Benchmarking and Enhancing Large Language Models for Biological Pathway Reasoning | Feb 23, 2025 | Benchmarking | CodeCode Available | 1 |
| VisFactor: Benchmarking Fundamental Visual Cognition in Multimodal Large Language Models | Feb 23, 2025 | BenchmarkingSpatial Reasoning | CodeCode Available | 0 |
| An Analyst-Inspector Framework for Evaluating Reproducibility of LLMs in Data Science | Feb 23, 2025 | BenchmarkingCode Generation | CodeCode Available | 0 |
| Unmasking Societal Biases in Respiratory Support for ICU Patients through Social Determinants of Health | Feb 23, 2025 | BenchmarkingFairness | CodeCode Available | 0 |
| Visual-RAG: Benchmarking Text-to-Image Retrieval Augmented Generation for Visual Knowledge Intensive Queries | Feb 23, 2025 | BenchmarkingImage Retrieval | CodeCode Available | 0 |
| MHQA: A Diverse, Knowledge Intensive Mental Health Question Answering Challenge for Language Models | Feb 21, 2025 | BenchmarkingDiagnostic | —Unverified | 0 |
| Methods and Trends in Detecting Generated Images: A Comprehensive Review | Feb 21, 2025 | BenchmarkingDeepFake Detection | —Unverified | 0 |
| Bridging vision language model (VLM) evaluation gaps with a framework for scalable and cost-effective benchmark generation | Feb 21, 2025 | BenchmarkingLanguage Modeling | —Unverified | 0 |
| Adversarial Prompt Evaluation: Systematic Benchmarking of Guardrails Against Prompt Input Attacks on LLMs | Feb 21, 2025 | Benchmarking | CodeCode Available | 1 |
| Benchmarking machine learning for bowel sound pattern classification from tabular features to pretrained models | Feb 21, 2025 | BenchmarkingDiagnostic | CodeCode Available | 0 |
| Para-Lane: Multi-Lane Dataset Registering Parallel Scans for Benchmarking Novel View Synthesis | Feb 21, 2025 | 3DGSAutonomous Driving | —Unverified | 0 |
| TritonBench: Benchmarking Large Language Model Capabilities for Generating Triton Operators | Feb 20, 2025 | BenchmarkingCode Generation | CodeCode Available | 2 |
| Statistical Scenario Modelling and Lookalike Distributions for Multi-Variate AI Risk | Feb 20, 2025 | Benchmarking | —Unverified | 0 |
| Probabilistic Robustness in Deep Learning: A Concise yet Comprehensive Guide | Feb 20, 2025 | Adversarial RobustnessBenchmarking | —Unverified | 0 |
| Benchmarking Multimodal RAG through a Chart-based Document Question-Answering Generation Framework | Feb 20, 2025 | BenchmarkingQuestion Answering | CodeCode Available | 0 |
| Sentence Smith: Formally Controllable Text Transformation and its Application to Evaluation of Text Embedding Models | Feb 20, 2025 | BenchmarkingSentence | —Unverified | 0 |
| Position: Graph Learning Will Lose Relevance Due To Poor Benchmarks | Feb 20, 2025 | BenchmarkingCombinatorial Optimization | —Unverified | 0 |
| Line Goes Up? Inherent Limitations of Benchmarks for Evaluating Large Language Models | Feb 20, 2025 | Benchmarking | —Unverified | 0 |
| Beyond Self-Talk: A Communication-Centric Survey of LLM-Based Multi-Agent Systems | Feb 20, 2025 | BenchmarkingDecision Making | —Unverified | 0 |
| FetalCLIP: A Visual-Language Foundation Model for Fetal Ultrasound Image Analysis | Feb 20, 2025 | Age EstimationBenchmarking | CodeCode Available | 2 |
| PredictaBoard: Benchmarking LLM Score Predictability | Feb 20, 2025 | BenchmarkingCommon Sense Reasoning | CodeCode Available | 0 |