| Position: Graph Learning Will Lose Relevance Due To Poor Benchmarks | Feb 20, 2025 | BenchmarkingCombinatorial Optimization | —Unverified | 0 |
| Line Goes Up? Inherent Limitations of Benchmarks for Evaluating Large Language Models | Feb 20, 2025 | Benchmarking | —Unverified | 0 |
| Statistical Scenario Modelling and Lookalike Distributions for Multi-Variate AI Risk | Feb 20, 2025 | Benchmarking | —Unverified | 0 |
| Sentence Smith: Formally Controllable Text Transformation and its Application to Evaluation of Text Embedding Models | Feb 20, 2025 | BenchmarkingSentence | —Unverified | 0 |
| Reinforcement Learning with Graph Attention for Routing and Wavelength Assignment with Lightpath Reuse | Feb 20, 2025 | BenchmarkingGraph Attention | —Unverified | 0 |
| Beyond Self-Talk: A Communication-Centric Survey of LLM-Based Multi-Agent Systems | Feb 20, 2025 | BenchmarkingDecision Making | —Unverified | 0 |
| Benchmarking Multimodal RAG through a Chart-based Document Question-Answering Generation Framework | Feb 20, 2025 | BenchmarkingQuestion Answering | CodeCode Available | 0 |
| PredictaBoard: Benchmarking LLM Score Predictability | Feb 20, 2025 | BenchmarkingCommon Sense Reasoning | CodeCode Available | 0 |
| VITAL: A New Dataset for Benchmarking Pluralistic Alignment in Healthcare | Feb 19, 2025 | BenchmarkingDiversity | —Unverified | 0 |
| Position: There are no Champions in Long-Term Time Series Forecasting | Feb 19, 2025 | BenchmarkingPosition | —Unverified | 0 |