| Mind the Confidence Gap: Overconfidence, Calibration, and Distractor Effects in Large Language Models | Feb 16, 2025 | Multiple-choice | CodeCode Available | 1 |
| TUMTraffic-VideoQA: A Benchmark for Unified Spatio-Temporal Video Understanding in Traffic Scenes | Feb 4, 2025 | Autonomous DrivingMultiple-choice | CodeCode Available | 1 |
| FaceXBench: Evaluating Multimodal LLMs on Face Understanding | Jan 17, 2025 | FairnessMultiple-choice | CodeCode Available | 1 |
| ToMATO: Verbalizing the Mental States of Role-Playing LLMs for Benchmarking Theory of Mind | Jan 15, 2025 | BenchmarkingMultiple-choice | CodeCode Available | 1 |
| ZNO-Eval: Benchmarking reasoning capabilities of large language models in Ukrainian | Jan 12, 2025 | BenchmarkingMath | CodeCode Available | 1 |
| Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation | Jan 6, 2025 | Language Model EvaluationLanguage Modeling | CodeCode Available | 1 |
| Unifying Specialized Visual Encoders for Video Language Models | Jan 2, 2025 | Multiple-choiceVideo Understanding | CodeCode Available | 1 |
| Filter-then-Generate: Large Language Models with Structure-Text Adapter for Knowledge Graph Completion | Dec 12, 2024 | HallucinationKnowledge Graph Completion | CodeCode Available | 1 |
| AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information? | Dec 3, 2024 | Multiple-choice | CodeCode Available | 1 |
| SailCompass: Towards Reproducible and Robust Evaluation for Southeast Asian Languages | Dec 2, 2024 | Multiple-choice | CodeCode Available | 1 |