| MATH-Perturb: Benchmarking LLMs' Math Reasoning Abilities against Hard Perturbations | Feb 10, 2025 | BenchmarkingIn-Context Learning | —Unverified | 0 |
| Accelerating Data Processing and Benchmarking of AI Models for Pathology | Feb 10, 2025 | Benchmarking | CodeCode Available | 4 |
| Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation | Feb 10, 2025 | Benchmarking | —Unverified | 0 |
| Evaluating the Systematic Reasoning Abilities of Large Language Models through Graph Coloring | Feb 10, 2025 | Benchmarking | CodeCode Available | 0 |
| CSR-Bench: Benchmarking LLM Agents in Deployment of Computer Science Research Repositories | Feb 10, 2025 | Benchmarking | —Unverified | 0 |
| Benchmarking Vision-Language Models on Optical Character Recognition in Dynamic Video Environments | Feb 10, 2025 | BenchmarkingOptical Character Recognition | CodeCode Available | 1 |
| Decoding Complexity: Intelligent Pattern Exploration with CHPDA (Context Aware Hybrid Pattern Detection Algorithm) | Feb 9, 2025 | BenchmarkingCPU | —Unverified | 0 |
| Benchmarking Prompt Engineering Techniques for Secure Code Generation with GPT Models | Feb 9, 2025 | BenchmarkingCode Generation | —Unverified | 0 |
| Mol-MoE: Training Preference-Guided Routers for Molecule Generation | Feb 8, 2025 | BenchmarkingDrug Design | CodeCode Available | 0 |
| ShiftySpeech: A Large-Scale Synthetic Speech Dataset with Distribution Shifts | Feb 8, 2025 | BenchmarkingSelf-Supervised Learning | CodeCode Available | 1 |