| VELOCITI: Benchmarking Video-Language Compositional Reasoning with Strict Entailment | Jun 16, 2024 | Action UnderstandingBenchmarking | —Unverified | 0 |
| VCEval: Rethinking What is a Good Educational Video and How to Automatically Evaluate It | Jun 15, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Evaluating ChatGPT-4 Vision on Brazil's National Undergraduate Computer Science Exam | Jun 14, 2024 | FairnessLogical Reasoning | CodeCode Available | 0 |
| DefAn: Definitive Answer Dataset for LLMs Hallucination Evaluation | Jun 13, 2024 | BenchmarkingHallucination | CodeCode Available | 0 |
| Bayesian Statistical Modeling with Predictors from LLMs | Jun 13, 2024 | Multiple-choice | —Unverified | 0 |
| AlignMMBench: Evaluating Chinese Multimodal Alignment in Large Vision-Language Models | Jun 13, 2024 | Multiple-choice | —Unverified | 0 |
| OLMES: A Standard for Language Model Evaluations | Jun 12, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| BertaQA: How Much Do Language Models Know About Local Culture? | Jun 11, 2024 | Multiple-choiceTransfer Learning | CodeCode Available | 0 |
| Decision-Making Behavior Evaluation Framework for LLMs under Uncertain Context | Jun 10, 2024 | Decision MakingMultiple-choice | —Unverified | 0 |
| Towards a Personal Health Large Language Model | Jun 10, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |