| Contextual Metric Meta-Evaluation by Measuring Local Metric Accuracy | Mar 25, 2025 | Benchmarkingspeech-recognition | —Unverified | 0 |
| Writing as a testbed for open ended agents | Mar 25, 2025 | BenchmarkingDiversity | —Unverified | 0 |
| The Coralscapes Dataset: Semantic Scene Understanding in Coral Reefs | Mar 25, 2025 | BenchmarkingScene Segmentation | CodeCode Available | 1 |
| Benchmarking Object Detectors under Real-World Distribution Shifts in Satellite Imagery | Mar 24, 2025 | BenchmarkingHumanitarian | CodeCode Available | 1 |
| Mining-Gym: A Configurable RL Benchmarking Environment for Truck Dispatch Scheduling | Mar 24, 2025 | BenchmarkingOpenAI Gym | CodeCode Available | 0 |
| LLM Benchmarking with LLaMA2: Evaluating Code Development Performance Across Multiple Programming Languages | Mar 24, 2025 | Benchmarking | CodeCode Available | 0 |
| Enhancing Multi-Label Emotion Analysis and Corresponding Intensities for Ethiopian Languages | Mar 24, 2025 | BenchmarkingDecision Making | —Unverified | 0 |
| Benchmarking Post-Hoc Unknown-Category Detection in Food Recognition | Mar 24, 2025 | BenchmarkingFood Recognition | —Unverified | 0 |
| Benchmarking Multi-modal Semantic Segmentation under Sensor Failures: Missing and Noisy Modality Robustness | Mar 24, 2025 | BenchmarkingSemantic Segmentation | CodeCode Available | 1 |
| EvAnimate: Event-conditioned Image-to-Video Generation for Human Animation | Mar 24, 2025 | BenchmarkingData Augmentation | —Unverified | 0 |