| Scenarios and Approaches for Situated Natural Language Explanations | Jun 7, 2024 | BenchmarkingIn-Context Learning | —Unverified | 0 |
| WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild | Jun 7, 2024 | BenchmarkingChatbot | CodeCode Available | 3 |
| Time Sensitive Knowledge Editing through Efficient Finetuning | Jun 6, 2024 | Benchmarkingknowledge editing | —Unverified | 0 |
| Omni6DPose: A Benchmark and Model for Universal 6D Object Pose Estimation and Tracking | Jun 6, 2024 | 6D Pose Estimation using RGBBenchmarking | —Unverified | 0 |
| NATURAL PLAN: Benchmarking LLMs on Natural Language Planning | Jun 6, 2024 | BenchmarkingScheduling | —Unverified | 0 |
| Benchmarking AlphaFold3's protein-protein complex accuracy and machine learning prediction reliability for binding free energy changes upon mutation | Jun 6, 2024 | BenchmarkingDrug Discovery | —Unverified | 0 |
| Performance of large language models in numerical vs. semantic medical knowledge: Benchmarking on evidence-based Q&As | Jun 6, 2024 | ArticlesBenchmarking | —Unverified | 0 |
| Bench2Drive: Towards Multi-Ability Benchmarking of Closed-Loop End-To-End Autonomous Driving | Jun 6, 2024 | Autonomous DrivingBench2Drive | CodeCode Available | 4 |
| Statistical Multicriteria Benchmarking via the GSD-Front | Jun 6, 2024 | Benchmarking | —Unverified | 0 |
| Better Late Than Never: Formulating and Benchmarking Recommendation Editing | Jun 6, 2024 | BenchmarkingRecommendation Systems | CodeCode Available | 0 |