| X-IQE: eXplainable Image Quality Evaluation for Text-to-Image Generation with Visual Large Language Models | May 18, 2023 | BenchmarkingImage Generation | CodeCode Available | 1 |
| Human Behavioral Benchmarking: Numeric Magnitude Comparison Effects in Large Language Models | May 18, 2023 | Benchmarking | —Unverified | 0 |
| Smiling Women Pitching Down: Auditing Representational and Presentational Gender Biases in Image Generative AI | May 17, 2023 | Benchmarking | —Unverified | 0 |
| PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering | May 17, 2023 | BenchmarkingDiagnostic | CodeCode Available | 1 |
| Towards More Robust NLP System Evaluation: Handling Missing Scores in Benchmarks | May 17, 2023 | Benchmarking | —Unverified | 0 |
| Restoring Images Captured in Arbitrary Hybrid Adverse Weather Conditions in One Go | May 17, 2023 | BenchmarkingImage Restoration | —Unverified | 0 |
| DLUE: Benchmarking Document Language Understanding | May 16, 2023 | BenchmarkingDocument Classification | —Unverified | 0 |
| An Empirical Study on Google Research Football Multi-agent Scenarios | May 16, 2023 | BenchmarkingMulti-agent Reinforcement Learning | CodeCode Available | 1 |
| Benchmarking the human brain against computational architectures | May 15, 2023 | BenchmarkingComputational Efficiency | —Unverified | 0 |
| OOD-Speech: A Large Bengali Speech Recognition Dataset for Out-of-Distribution Benchmarking | May 15, 2023 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 |