| Beyond Chains of Thought: Benchmarking Latent-Space Reasoning Abilities in Large Language Models | Apr 14, 2025 | BenchmarkingDescriptive | —Unverified | 0 |
| Trade-offs in Privacy-Preserving Eye Tracking through Iris Obfuscation: A Benchmarking Study | Apr 14, 2025 | BenchmarkingGaze Estimation | CodeCode Available | 0 |
| Benchmarking Practices in LLM-driven Offensive Security: Testbeds, Metrics, and Experiment Design | Apr 14, 2025 | BenchmarkingLanguage Modeling | —Unverified | 0 |
| NoTeS-Bank: Benchmarking Neural Transcription and Search for Scientific Notes Understanding | Apr 12, 2025 | BenchmarkingDocument AI | —Unverified | 0 |
| SortBench: Benchmarking LLMs based on their ability to sort lists | Apr 11, 2025 | Benchmarking | —Unverified | 0 |
| TP-RAG: Benchmarking Retrieval-Augmented Large Language Model Agents for Spatiotemporal-Aware Travel Planning | Apr 11, 2025 | BenchmarkingLanguage Modeling | —Unverified | 0 |
| NorEval: A Norwegian Language Understanding and Generation Evaluation Benchmark | Apr 10, 2025 | Benchmarking | CodeCode Available | 0 |
| Geological Inference from Textual Data using Word Embeddings | Apr 10, 2025 | BenchmarkingWord Embeddings | CodeCode Available | 0 |
| Benchmarking Adversarial Robustness to Bias Elicitation in Large Language Models: Scalable Automated Assessment with LLM-as-a-Judge | Apr 10, 2025 | Adversarial RobustnessBenchmarking | CodeCode Available | 0 |
| Adaptive Shrinkage Estimation For Personalized Deep Kernel Regression In Modeling Brain Trajectories | Apr 10, 2025 | Additive modelsBenchmarking | CodeCode Available | 0 |