| LLM Evaluation Based on Aerospace Manufacturing Expertise: Automated Generation and Multi-Model Question Answering | Jan 25, 2025 | Information RetrievalMultiple-choice | —Unverified | 0 |
| Humanity's Last Exam | Jan 24, 2025 | Humanity's Last ExamLanguage Modeling | —Unverified | 0 |
| On the Reasoning Capacity of AI Models and How to Quantify It | Jan 23, 2025 | MemorizationMMLU | —Unverified | 0 |
| Auto-Evaluation: A Critical Measure in Driving Improvements in Quality and Safety of AI-Generated Lesson Resources | Jan 23, 2025 | Multiple-choice | —Unverified | 0 |
| Patent Figure Classification using Large Vision-language Models | Jan 22, 2025 | ClassificationFew-Shot Learning | CodeCode Available | 0 |
| The AI Penalization Effect: People Reduce Compensation for Workers Who Use AI | Jan 22, 2025 | Multiple-choice | —Unverified | 0 |
| Generating Plausible Distractors for Multiple-Choice Questions via Student Choice Prediction | Jan 21, 2025 | Distractor GenerationMisconceptions | —Unverified | 0 |
| Can Multimodal LLMs do Visual Temporal Understanding and Reasoning? The answer is No! | Jan 18, 2025 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| Empowering Large Language Models in Wireless Communication: A Novel Dataset and Fine-Tuning Framework | Jan 16, 2025 | Multiple-choiceQuestion Generation | —Unverified | 0 |
| Vision-Language Models Do Not Understand Negation | Jan 16, 2025 | Multiple-choiceNegation | —Unverified | 0 |
| Multiple Choice Questions: Reasoning Makes Large Language Models (LLMs) More Self-Confident Even When They Are Wrong | Jan 16, 2025 | Multiple-choice | —Unverified | 0 |
| Towards Multilingual LLM Evaluation for Baltic and Nordic languages: A study on Lithuanian History | Jan 15, 2025 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| Rethinking AI Cultural Alignment | Jan 13, 2025 | Multiple-choice | —Unverified | 0 |
| Hierarchical Divide-and-Conquer for Fine-Grained Alignment in LLM-Based Medical Evaluation | Jan 12, 2025 | AttributeMultiple-choice | —Unverified | 0 |
| First Token Probability Guided RAG for Telecom Question Answering | Jan 11, 2025 | Multiple-choiceMultiple Choice Question Answering (MCQA) | —Unverified | 0 |
| Fleurs-SLU: A Massively Multilingual Benchmark for Spoken Language Understanding | Jan 10, 2025 | Automatic Speech RecognitionClassification | CodeCode Available | 0 |
| Affordably Fine-tuned LLMs Provide Better Answers to Course-specific MCQs | Jan 10, 2025 | Multiple-choice | CodeCode Available | 0 |
| Knowledge Retrieval Based on Generative AI | Jan 8, 2025 | Large Language ModelMultiple-choice | —Unverified | 0 |
| DRIVINGVQA: Analyzing Visual Chain-of-Thought Reasoning of Vision Language Models in Real-World Scenarios with Driving Theory Tests | Jan 8, 2025 | Multimodal ReasoningMultiple-choice | —Unverified | 0 |
| Localizing AI: Evaluating Open-Weight Language Models for Languages of Baltic States | Jan 7, 2025 | Machine TranslationMultiple-choice | —Unverified | 0 |
| (WhyPHI) Fine-Tuning PHI-3 for Multiple-Choice Question Answering: Methodology, Results, and Challenges | Jan 3, 2025 | Multiple-choiceQuestion Answering | CodeCode Available | 0 |
| CLIP-UP: CLIP-Based Unanswerable Problem Detection for Visual Question Answering | Jan 2, 2025 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| FSBench: A Figure Skating Benchmark for Advancing Artistic Sports Understanding | Jan 1, 2025 | Action RecognitionMultiple-choice | CodeCode Available | 0 |
| Enhancing Video-LLM Reasoning via Agent-of-Thoughts Distillation | Jan 1, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Q-Bench-Video: Benchmark the Video Quality Understanding of LMMs | Jan 1, 2025 | Multiple-choiceVideo Generation | —Unverified | 0 |