| ToMATO: Verbalizing the Mental States of Role-Playing LLMs for Benchmarking Theory of Mind | Jan 15, 2025 | BenchmarkingMultiple-choice | CodeCode Available | 1 |
| Towards Multilingual LLM Evaluation for Baltic and Nordic languages: A study on Lithuanian History | Jan 15, 2025 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| Rethinking AI Cultural Alignment | Jan 13, 2025 | Multiple-choice | —Unverified | 0 |
| Hierarchical Divide-and-Conquer for Fine-Grained Alignment in LLM-Based Medical Evaluation | Jan 12, 2025 | AttributeMultiple-choice | —Unverified | 0 |
| ZNO-Eval: Benchmarking reasoning capabilities of large language models in Ukrainian | Jan 12, 2025 | BenchmarkingMath | CodeCode Available | 1 |
| First Token Probability Guided RAG for Telecom Question Answering | Jan 11, 2025 | Multiple-choiceMultiple Choice Question Answering (MCQA) | —Unverified | 0 |
| Fleurs-SLU: A Massively Multilingual Benchmark for Spoken Language Understanding | Jan 10, 2025 | Automatic Speech RecognitionClassification | CodeCode Available | 0 |
| Affordably Fine-tuned LLMs Provide Better Answers to Course-specific MCQs | Jan 10, 2025 | Multiple-choice | CodeCode Available | 0 |
| Knowledge Retrieval Based on Generative AI | Jan 8, 2025 | Large Language ModelMultiple-choice | —Unverified | 0 |
| DRIVINGVQA: Analyzing Visual Chain-of-Thought Reasoning of Vision Language Models in Real-World Scenarios with Driving Theory Tests | Jan 8, 2025 | Multimodal ReasoningMultiple-choice | —Unverified | 0 |