| Dissecting Dissonance: Benchmarking Large Multimodal Models Against Self-Contradictory Instructions | Aug 2, 2024 | Benchmarkingmultimodal interaction | CodeCode Available | 0 | 5 |
| AI-generated Image Quality Assessment in Visual Communication | Dec 20, 2024 | BenchmarkingImage Quality Assessment | CodeCode Available | 0 | 5 |
| Benchmarking Large Language Models for Molecule Prediction Tasks | Mar 8, 2024 | BenchmarkingPrediction | CodeCode Available | 0 | 5 |
| DispBench: Benchmarking Disparity Estimation to Synthetic Corruptions | May 8, 2025 | Autonomous NavigationBenchmarking | CodeCode Available | 0 | 5 |
| Are Large Language Models Good at Utility Judgments? | Mar 28, 2024 | Answer GenerationBenchmarking | CodeCode Available | 0 | 5 |
| DispaRisk: Auditing Fairness Through Usable Information | May 20, 2024 | BenchmarkingBias Detection | CodeCode Available | 0 | 5 |
| A Framework for Evaluating PM2.5 Forecasts from the Perspective of Individual Decision Making | Sep 9, 2024 | BenchmarkingDecision Making | CodeCode Available | 0 | 5 |
| GenCeption: Evaluate Multimodal LLMs with Unlabeled Unimodal Data | Feb 22, 2024 | Benchmarking | CodeCode Available | 0 | 5 |
| GenderBench: Evaluation Suite for Gender Biases in LLMs | May 17, 2025 | Benchmarking | CodeCode Available | 0 | 5 |
| Generalization and Regularization in DQN | Sep 29, 2018 | Atari GamesBenchmarking | CodeCode Available | 0 | 5 |