| TritonBench: Benchmarking Large Language Model Capabilities for Generating Triton Operators | Feb 20, 2025 | BenchmarkingCode Generation | CodeCode Available | 2 |
| Statistical Scenario Modelling and Lookalike Distributions for Multi-Variate AI Risk | Feb 20, 2025 | Benchmarking | —Unverified | 0 |
| Probabilistic Robustness in Deep Learning: A Concise yet Comprehensive Guide | Feb 20, 2025 | Adversarial RobustnessBenchmarking | —Unverified | 0 |
| Benchmarking Multimodal RAG through a Chart-based Document Question-Answering Generation Framework | Feb 20, 2025 | BenchmarkingQuestion Answering | CodeCode Available | 0 |
| Sentence Smith: Formally Controllable Text Transformation and its Application to Evaluation of Text Embedding Models | Feb 20, 2025 | BenchmarkingSentence | —Unverified | 0 |
| Position: Graph Learning Will Lose Relevance Due To Poor Benchmarks | Feb 20, 2025 | BenchmarkingCombinatorial Optimization | —Unverified | 0 |
| Line Goes Up? Inherent Limitations of Benchmarks for Evaluating Large Language Models | Feb 20, 2025 | Benchmarking | —Unverified | 0 |
| Beyond Self-Talk: A Communication-Centric Survey of LLM-Based Multi-Agent Systems | Feb 20, 2025 | BenchmarkingDecision Making | —Unverified | 0 |
| FetalCLIP: A Visual-Language Foundation Model for Fetal Ultrasound Image Analysis | Feb 20, 2025 | Age EstimationBenchmarking | CodeCode Available | 2 |
| PredictaBoard: Benchmarking LLM Score Predictability | Feb 20, 2025 | BenchmarkingCommon Sense Reasoning | CodeCode Available | 0 |