| First Token Probability Guided RAG for Telecom Question Answering | Jan 11, 2025 | Multiple-choiceMultiple Choice Question Answering (MCQA) | —Unverified | 0 |
| From Generalist to Specialist: Improving Large Language Models for Medical Physics Using ARCoT | May 17, 2024 | BenchmarkingMultiple-choice | —Unverified | 0 |
| Cleared for Takeoff? Compositional & Conditional Reasoning may be the Achilles Heel to (Flight-Booking) Language Agents | Apr 5, 2024 | Multiple-choiceNavigate | —Unverified | 0 |
| CinePile: A Long Video Question Answering Dataset and Benchmark | May 14, 2024 | FormHuman-Object Interaction Detection | —Unverified | 0 |
| ARGUS: Hallucination and Omission Evaluation in Video-LLMs | Jun 9, 2025 | DescriptiveForm | —Unverified | 0 |
| Chat-TS: Enhancing Multi-Modal Reasoning Over Time-Series and Natural Language Data | Mar 13, 2025 | Large Language ModelMath | —Unverified | 0 |
| Are You Doubtful? Oh, It Might Be Difficult Then! Exploring the Use of Model Uncertainty for Question Difficulty Estimation | Dec 16, 2024 | Multiple-choice | —Unverified | 0 |
| AmazUtah_NLP at SemEval-2024 Task 9: A MultiChoice Question Answering System for Commonsense Defying Reasoning | May 16, 2024 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| A review of faithfulness metrics for hallucination assessment in Large Language Models | Dec 31, 2024 | BenchmarkingHallucination | —Unverified | 0 |
| Adaptive Wizard for Removing Cross-Tier Misconfigurations in Active Directory | May 2, 2025 | Multiple-choice | —Unverified | 0 |
| Characterizing Large Language Models as Rationalizers of Knowledge-intensive Tasks | Nov 9, 2023 | Multiple-choiceWorld Knowledge | —Unverified | 0 |
| Changing Answer Order Can Decrease MMLU Accuracy | Jun 27, 2024 | MMLUMultiple-choice | —Unverified | 0 |
| Fill-in-the-Blank: A Challenging Video Understanding Evaluation Framework | Nov 16, 2021 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| Are LLM-generated plain language summaries truly understandable? A large-scale crowdsourced evaluation | May 15, 2025 | InformativenessMultiple-choice | —Unverified | 0 |
| CG-Bench: Clue-grounded Question Answering Benchmark for Long Video Understanding | Dec 16, 2024 | HallucinationMultiple-choice | —Unverified | 0 |
| Adaptive Crowdsourcing Algorithms for the Bandit Survey Problem | Feb 13, 2013 | Information RetrievalMultiple-choice | —Unverified | 0 |
| CFinBench: A Comprehensive Chinese Financial Benchmark for Large Language Models | Jul 2, 2024 | Multiple-choice | —Unverified | 0 |
| Few-Shot Image Classification and Segmentation as Visual Question Answering Using Vision-Language Models | Mar 15, 2024 | Few-Shot Image Classificationimage-classification | —Unverified | 0 |
| CBT-Bench: Evaluating Large Language Models on Assisting Cognitive Behavior Therapy | Oct 17, 2024 | Multiple-choiceResponse Generation | —Unverified | 0 |
| A recent evaluation on the performance of LLMs on radiation oncology physics using questions of randomly shuffled options | Dec 14, 2024 | Multiple-choice | —Unverified | 0 |
| AlignMMBench: Evaluating Chinese Multimodal Alignment in Large Vision-Language Models | Jun 13, 2024 | Multiple-choice | —Unverified | 0 |
| AraTrust: An Evaluation of Trustworthiness for LLMs in Arabic | Mar 14, 2024 | EthicsMultiple-choice | —Unverified | 0 |
| Recent Advances in Multi-Choice Machine Reading Comprehension: A Survey on Methods and Datasets | Aug 4, 2024 | Few-Shot LearningMachine Reading Comprehension | —Unverified | 0 |
| Field-testing items using artificial intelligence: Natural language processing with transformers | Oct 18, 2023 | Multiple-choice | —Unverified | 0 |
| Fine-tuning BERT with Focus Words for Explanation Regeneration | Dec 1, 2020 | Explanation GenerationMultiple-choice | —Unverified | 0 |
| Can We Trust LLMs? Mitigate Overconfidence Bias in LLMs through Knowledge Transfer | May 27, 2024 | Multiple-choiceSentiment Analysis | —Unverified | 0 |
| AraSTEM: A Native Arabic Multiple Choice Question Benchmark for Evaluating LLMs Knowledge In STEM Subjects | Dec 31, 2024 | BenchmarkingMultiple-choice | —Unverified | 0 |
| FAMULUS: Interactive Annotation and Feedback Generation for Teaching Diagnostic Reasoning | Aug 29, 2019 | DiagnosticMultiple-choice | —Unverified | 0 |
| Can Multimodal LLMs do Visual Temporal Understanding and Reasoning? The answer is No! | Jan 18, 2025 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| GeoSQA: A Benchmark for Scenario-based Question Answering in the Geography Domain at High School Level | Aug 20, 2019 | General KnowledgeMultiple-choice | —Unverified | 0 |
| FarsEval-PKBETS: A new diverse benchmark for evaluating Persian large language models | Apr 20, 2025 | DescriptiveEthics | —Unverified | 0 |
| Aqulia-Med LLM: Pioneering Full-Process Open-Source Medical Language Models | Jun 18, 2024 | Multiple-choice | —Unverified | 0 |
| FactTest: Factuality Testing in Large Language Models with Finite-Sample and Distribution-Free Guarantees | Nov 4, 2024 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| Can Generative Pre-trained Transformers (GPT) Pass Assessments in Higher Education Programming Courses? | Mar 16, 2023 | Multiple-choice | —Unverified | 0 |
| Can Crowdsourcing be used for Effective Annotation of Arabic? | May 1, 2014 | Entity ResolutionMultiple-choice | —Unverified | 0 |
| Can ChatGPT pass the Vietnamese National High School Graduation Examination? | Jun 15, 2023 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Applying IRT to Distinguish Between Human and Generative AI Responses to Multiple-Choice Assessments | Nov 28, 2024 | Multiple-choice | —Unverified | 0 |
| Enhancing lexical-based approach with external knowledge for Vietnamese multiple-choice machine reading comprehension | Jan 16, 2020 | Machine Reading ComprehensionMultiple-choice | —Unverified | 0 |
| Can AI Master Construction Management (CM)? Benchmarking State-of-the-Art Large Language Models on CM Certification Exams | Apr 4, 2025 | BenchmarkingManagement | —Unverified | 0 |
| A Joint-Reasoning based Disease Q&A System | Jan 6, 2024 | Knowledge GraphsMisinformation | —Unverified | 0 |
| How Additional Knowledge can Improve Natural Language Commonsense Question Answering? | Sep 19, 2019 | ArticlesLanguage Modeling | —Unverified | 0 |
| Analysis of the Cambridge Multiple-Choice Questions Reading Dataset with a Focus on Candidate Response Distribution | Jun 22, 2023 | Multiple-choice | —Unverified | 0 |
| Answer Uncertainty and Unanswerability in Multiple-Choice Machine Reading Comprehension | May 1, 2022 | Machine Reading ComprehensionMultiple-choice | —Unverified | 0 |
| Answer Uncertainty and Unanswerability in Multiple-Choice Machine Reading Comprehension | Jan 16, 2022 | Machine Reading ComprehensionMultiple-choice | —Unverified | 0 |
| Bridging the Language Gap: Knowledge Injected Multilingual Question Answering | Apr 6, 2023 | Cross-Lingual TransferExtractive Question-Answering | —Unverified | 0 |
| Bridging Information-Seeking Human Gaze and Machine Reading Comprehension | Sep 30, 2020 | Machine Reading ComprehensionMultiple-choice | —Unverified | 0 |
| Adapting Vision-Language Models for Evaluating World Models | Jun 22, 2025 | Action RecognitionMultimodal Reasoning | —Unverified | 0 |
| Exposing the Limits of Video-Text Models through Contrast Sets | Jan 16, 2022 | Language ModelingLanguage Modelling | —Unverified | 0 |
| FAVOR-Bench: A Comprehensive Benchmark for Fine-Grained Video Motion Understanding | Mar 19, 2025 | BenchmarkingMultiple-choice | —Unverified | 0 |
| SHARP: Unlocking Interactive Hallucination via Stance Transfer in Role-Playing Agents | Nov 12, 2024 | General KnowledgeHallucination | —Unverified | 0 |