| Generative Psycho-Lexical Approach for Constructing Value Systems in Large Language Models | Feb 4, 2025 | BenchmarkingDecision Making | —Unverified | 0 |
| Dynamic benchmarking framework for LLM-based conversational data capture | Feb 4, 2025 | Benchmarking | —Unverified | 0 |
| MJ-VIDEO: Fine-Grained Benchmarking and Rewarding Video Preferences in Video Generation | Feb 3, 2025 | BenchmarkingFairness | —Unverified | 0 |
| SE Arena: An Interactive Platform for Evaluating Foundation Models in Software Engineering | Feb 3, 2025 | BenchmarkingCode Generation | —Unverified | 0 |
| EdgeMark: An Automation and Benchmarking System for Embedded Artificial Intelligence Tools | Feb 3, 2025 | Benchmarking | —Unverified | 0 |
| Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities | Feb 3, 2025 | BenchmarkingLarge Language Model | —Unverified | 0 |
| Learned Bayesian Cramér-Rao Bound for Unknown Measurement Models Using Score Neural Networks | Feb 2, 2025 | Benchmarking | CodeCode Available | 0 |
| True Online TD-Replan(lambda) Achieving Planning through Replaying | Jan 31, 2025 | Benchmarking | —Unverified | 0 |
| MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding | Jan 30, 2025 | BenchmarkingDecision Making | —Unverified | 0 |
| Fine-tuning LLaMA 2 interference: a comparative study of language implementations for optimal efficiency | Jan 30, 2025 | BenchmarkingLanguage Modeling | —Unverified | 0 |