| PEDANTS: Cheap but Effective and Interpretable Answer Equivalence | Feb 17, 2024 | BenchmarkingForm | CodeCode Available | 2 |
| AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator | Feb 15, 2024 | BenchmarkingDiagnostic | CodeCode Available | 2 |
| MultiMedEval: A Benchmark and a Toolkit for Evaluating Medical Vision-Language Models | Feb 14, 2024 | BenchmarkingDiversity | CodeCode Available | 2 |
| LoTa-Bench: Benchmarking Language-oriented Task Planners for Embodied Agents | Feb 13, 2024 | BenchmarkingModel Selection | CodeCode Available | 2 |
| Customizable Perturbation Synthesis for Robust SLAM Benchmarking | Feb 12, 2024 | BenchmarkingSimultaneous Localization and Mapping | CodeCode Available | 2 |
| AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension | Feb 12, 2024 | 2kAutomatic Speech Recognition | CodeCode Available | 2 |
| InstructScene: Instruction-Driven 3D Indoor Scene Synthesis with Semantic Graph Prior | Feb 7, 2024 | BenchmarkingDecoder | CodeCode Available | 2 |
| LV-Eval: A Balanced Long-Context Benchmark with 5 Length Levels Up to 256K | Feb 6, 2024 | 16kBenchmarking | CodeCode Available | 2 |
| LtU-ILI: An All-in-One Framework for Implicit Inference in Astrophysics and Cosmology | Feb 6, 2024 | AllBenchmarking | CodeCode Available | 2 |
| EffiBench: Benchmarking the Efficiency of Automatically Generated Code | Feb 3, 2024 | BenchmarkingCode Completion | CodeCode Available | 2 |