| X-IQE: eXplainable Image Quality Evaluation for Text-to-Image Generation with Visual Large Language Models | May 18, 2023 | BenchmarkingImage Generation | CodeCode Available | 1 |
| Human Behavioral Benchmarking: Numeric Magnitude Comparison Effects in Large Language Models | May 18, 2023 | Benchmarking | —Unverified | 0 |
| Smiling Women Pitching Down: Auditing Representational and Presentational Gender Biases in Image Generative AI | May 17, 2023 | Benchmarking | —Unverified | 0 |
| PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering | May 17, 2023 | BenchmarkingDiagnostic | CodeCode Available | 1 |
| Towards More Robust NLP System Evaluation: Handling Missing Scores in Benchmarks | May 17, 2023 | Benchmarking | —Unverified | 0 |
| Restoring Images Captured in Arbitrary Hybrid Adverse Weather Conditions in One Go | May 17, 2023 | BenchmarkingImage Restoration | —Unverified | 0 |
| DLUE: Benchmarking Document Language Understanding | May 16, 2023 | BenchmarkingDocument Classification | —Unverified | 0 |
| An Empirical Study on Google Research Football Multi-agent Scenarios | May 16, 2023 | BenchmarkingMulti-agent Reinforcement Learning | CodeCode Available | 1 |
| Benchmarking the human brain against computational architectures | May 15, 2023 | BenchmarkingComputational Efficiency | —Unverified | 0 |
| OOD-Speech: A Large Bengali Speech Recognition Dataset for Out-of-Distribution Benchmarking | May 15, 2023 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 |
| Predictive Models from Quantum Computer Benchmarks | May 15, 2023 | Benchmarkingimage-classification | —Unverified | 0 |
| A Strong Sustainability Paradigm Based Analytical Hierarchy Process (SSP-AHP) Method to Evaluate Sustainable Healthcare Systems | May 13, 2023 | Benchmarking | —Unverified | 0 |
| MedGPTEval: A Dataset and Benchmark to Evaluate Responses of Large Language Models in Medicine | May 12, 2023 | Benchmarking | —Unverified | 0 |
| Benchmarking large language models for biomedical natural language processing applications and recommendations | May 10, 2023 | BenchmarkingDocument Classification | CodeCode Available | 1 |
| A Platform for the Biomedical Application of Large Language Models | May 10, 2023 | BenchmarkingPrivacy Preserving | CodeCode Available | 1 |
| Uncertainty in GNN Learning Evaluations: The Importance of a Consistent Benchmark for Community Detection | May 10, 2023 | BenchmarkingCommunity Detection | —Unverified | 0 |
| InfoMetIC: An Informative Metric for Reference-free Image Caption Evaluation | May 10, 2023 | BenchmarkingImage Captioning | CodeCode Available | 1 |
| DexArt: Benchmarking Generalizable Dexterous Manipulation with Articulated Objects | May 9, 2023 | BenchmarkingDecision Making | CodeCode Available | 1 |
| Comparing Foundation Models using Data Kernels | May 9, 2023 | BenchmarkingSelf-Supervised Learning | —Unverified | 0 |
| A Comprehensive Study on Dataset Distillation: Performance, Privacy, Robustness and Fairness | May 5, 2023 | BenchmarkingDataset Distillation | —Unverified | 0 |
| Towards Segment Anything Model (SAM) for Medical Image Segmentation: A Survey | May 5, 2023 | BenchmarkingImage Generation | CodeCode Available | 0 |
| Semantic Segmentation using Vision Transformers: A survey | May 5, 2023 | Autonomous DrivingBenchmarking | —Unverified | 0 |
| Can LLMs Capture Human Preferences? | May 4, 2023 | Benchmarking | —Unverified | 0 |
| Analyzing Hong Kong's Legal Judgments from a Computational Linguistics point-of-view | May 4, 2023 | BenchmarkingGraph Generation | —Unverified | 0 |
| Working Memory Capacity of ChatGPT: An Empirical Study | Apr 30, 2023 | BenchmarkingLanguage Modeling | CodeCode Available | 1 |