| Isolating Language-Coding from Problem-Solving: Benchmarking LLMs with PseudoEval | Feb 26, 2025 | BenchmarkingCode Generation | —Unverified | 0 |
| BatteryLife: A Comprehensive Dataset and Benchmark for Battery Life Prediction | Feb 26, 2025 | BenchmarkingTime Series | CodeCode Available | 3 |
| MEBench: Benchmarking Large Language Models for Cross-Document Multi-Entity Question Answering | Feb 26, 2025 | BenchmarkingQuestion Answering | —Unverified | 0 |
| Problem Solved? Information Extraction Design Space for Layout-Rich Documents using LLMs | Feb 25, 2025 | BenchmarkingChunking | CodeCode Available | 1 |
| Science Across Languages: Assessing LLM Multilingual Translation of Scientific Papers | Feb 25, 2025 | ArticlesBenchmarking | —Unverified | 0 |
| CayleyPy RL: Pathfinding and Reinforcement Learning on Cayley Graphs | Feb 25, 2025 | Benchmarkingreinforcement-learning | —Unverified | 0 |
| Safe Multi-Agent Navigation guided by Goal-Conditioned Safe Reinforcement Learning | Feb 25, 2025 | BenchmarkingReinforcement Learning (RL) | CodeCode Available | 0 |
| OpenFly: A Comprehensive Platform for Aerial Vision-Language Navigation | Feb 25, 2025 | BenchmarkingSemantic Segmentation | —Unverified | 0 |
| A Real-time Spatio-Temporal Trajectory Planner for Autonomous Vehicles with Semantic Graph Optimization | Feb 25, 2025 | Autonomous VehiclesBenchmarking | —Unverified | 0 |
| Overconfident Oracles: Limitations of In Silico Sequence Design Benchmarking | Feb 24, 2025 | Benchmarking | —Unverified | 0 |