| xai_evals : A Framework for Evaluating Post-Hoc Local Explanation Methods | Feb 5, 2025 | Benchmarking | —Unverified | 0 |
| LadderMIL: Multiple Instance Learning with Coarse-to-Fine Self-Distillation | Feb 4, 2025 | BenchmarkingClassification | —Unverified | 0 |
| Dynamic benchmarking framework for LLM-based conversational data capture | Feb 4, 2025 | Benchmarking | —Unverified | 0 |
| Rankify: A Comprehensive Python Toolkit for Retrieval, Re-Ranking, and Retrieval-Augmented Generation | Feb 4, 2025 | BenchmarkingInformation Retrieval | CodeCode Available | 4 |
| Evalita-LLM: Benchmarking Large Language Models on Italian | Feb 4, 2025 | BenchmarkingMultiple-choice | —Unverified | 0 |
| Generative Psycho-Lexical Approach for Constructing Value Systems in Large Language Models | Feb 4, 2025 | BenchmarkingDecision Making | —Unverified | 0 |
| A comparison of translation performance between DeepL and Supertext | Feb 4, 2025 | BenchmarkingMachine Translation | CodeCode Available | 0 |
| No Metric to Rule Them All: Toward Principled Evaluations of Graph-Learning Datasets | Feb 4, 2025 | AllBenchmarking | CodeCode Available | 0 |
| Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities | Feb 3, 2025 | BenchmarkingLarge Language Model | —Unverified | 0 |
| MJ-VIDEO: Fine-Grained Benchmarking and Rewarding Video Preferences in Video Generation | Feb 3, 2025 | BenchmarkingFairness | —Unverified | 0 |