| MJ-VIDEO: Fine-Grained Benchmarking and Rewarding Video Preferences in Video Generation | Feb 3, 2025 | BenchmarkingFairness | —Unverified | 0 |
| SE Arena: An Interactive Platform for Evaluating Foundation Models in Software Engineering | Feb 3, 2025 | BenchmarkingCode Generation | —Unverified | 0 |
| MM-IQ: Benchmarking Human-Like Abstraction and Reasoning in Multimodal Models | Feb 2, 2025 | Benchmarking | CodeCode Available | 1 |
| Learned Bayesian Cramér-Rao Bound for Unknown Measurement Models Using Score Neural Networks | Feb 2, 2025 | Benchmarking | CodeCode Available | 0 |
| True Online TD-Replan(lambda) Achieving Planning through Replaying | Jan 31, 2025 | Benchmarking | —Unverified | 0 |
| Evolving Hard Maximum Cut Instances for Quantum Approximate Optimization Algorithms | Jan 30, 2025 | BenchmarkingCombinatorial Optimization | —Unverified | 0 |
| Fine-tuning LLaMA 2 interference: a comparative study of language implementations for optimal efficiency | Jan 30, 2025 | BenchmarkingLanguage Modeling | —Unverified | 0 |
| The iToBoS dataset: skin region images extracted from 3D total body photographs for lesion detection | Jan 30, 2025 | BenchmarkingDiagnostic | CodeCode Available | 0 |
| MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding | Jan 30, 2025 | BenchmarkingDecision Making | —Unverified | 0 |
| Unraveling the Capabilities of Language Models in News Summarization | Jan 30, 2025 | BenchmarkingFew-Shot Learning | CodeCode Available | 0 |