| End-to-end Concept Word Detection for Video Captioning, Retrieval, and Question Answering | Oct 10, 2016 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Enhancing Distractor Generation for Multiple-Choice Questions with Retrieval Augmented Pretraining and Knowledge Graph Integration | Jun 19, 2024 | BenchmarkingDistractor Generation | —Unverified | 0 |
| Enhancing Event Causality Identification with Rationale and Structure-Aware Causal Question Answering | Mar 17, 2024 | Event Causality IdentificationMultiple-choice | —Unverified | 0 |
| Towards Collective Superintelligence: Amplifying Group IQ using Conversational Swarms | Jan 25, 2024 | Decision MakingMultiple-choice | —Unverified | 0 |
| Towards combinatorial clustering: preliminary research survey | May 28, 2015 | ClusteringCombinatorial Optimization | —Unverified | 0 |
| Enhancing LLM Evaluations: The Garbling Trick | Nov 3, 2024 | Multiple-choice | —Unverified | 0 |
| Enhancing LLMs' Reasoning-Intensive Multimedia Search Capabilities through Fine-Tuning and Reinforcement Learning | May 24, 2025 | Multiple-choicePrompt Engineering | —Unverified | 0 |
| Enhancing Multiple-choice Machine Reading Comprehension by Punishing Illogical Interpretations | Nov 1, 2021 | AttributeMachine Reading Comprehension | —Unverified | 0 |
| Enhancing Multiple-Choice Question Answering with Causal Knowledge | Jun 1, 2021 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| Enhancing Video-LLM Reasoning via Agent-of-Thoughts Distillation | Jan 1, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| EQUATOR: A Deterministic Framework for Evaluating LLM Reasoning with Open-Ended Questions. # v1.0.0-beta | Dec 31, 2024 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| Establishing Task Scaling Laws via Compute-Efficient Model Ladders | Dec 5, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Towards Conversational AI for Disease Management | Mar 8, 2025 | Clinical KnowledgeDiagnostic | —Unverified | 0 |
| Evalita-LLM: Benchmarking Large Language Models on Italian | Feb 4, 2025 | BenchmarkingMultiple-choice | —Unverified | 0 |
| Towards Decision Support Technology Platform for Modular Systems | Aug 23, 2014 | ClusteringCombinatorial Optimization | —Unverified | 0 |
| Evaluating LLM-corrupted Crowdsourcing Data Without Ground Truth | Jun 8, 2025 | Multiple-choice | —Unverified | 0 |
| Evaluating LLM -- Generated Multimodal Diagnosis from Medical Images and Symptom Analysis | Jan 28, 2024 | Knowledge GraphsMedical Diagnosis | —Unverified | 0 |
| Evaluating LLMs on Document-Based QA: Exact Answer Selection and Numerical Extraction using Cogtale dataset | Nov 14, 2023 | Answer SelectionInformation Retrieval | —Unverified | 0 |
| Evaluating Machine Reading Systems through Comprehension Tests | May 1, 2012 | Answer SelectionMultiple-choice | —Unverified | 0 |
| Evaluating multiple large language models in pediatric ophthalmology | Nov 7, 2023 | Multiple-choice | —Unverified | 0 |
| Evaluating Nuanced Bias in Large Language Model Free Response Answers | Jul 11, 2024 | BenchmarkingLanguage Modeling | —Unverified | 0 |
| Evaluating Question Answering Evaluation | Nov 1, 2019 | Answer GenerationMultiple-choice | —Unverified | 0 |
| A Corpus of Text Data and Gaze Fixations from Autistic and Non-Autistic Adults | May 1, 2016 | Multiple-choicePOS | —Unverified | 0 |
| Evaluating the Performance and Robustness of LLMs in Materials Science Q&A and Property Predictions | Sep 22, 2024 | Band GapIn-Context Learning | —Unverified | 0 |
| Evaluating the Potential of Leading Large Language Models in Reasoning Biology Questions | Nov 5, 2023 | Logical ReasoningMultiple-choice | —Unverified | 0 |
| Evaluating the Rationale Understanding of Critical Reasoning in Logical Reading Comprehension | Nov 30, 2023 | Multiple-choiceReading Comprehension | —Unverified | 0 |
| Evaluating the Symbol Binding Ability of Large Language Models for Multiple-Choice Questions in Vietnamese General Education | Oct 18, 2023 | Multiple-choiceMultiple Choice Question Answering (MCQA) | —Unverified | 0 |
| Evaluating Vision-Language and Large Language Models for Automated Student Assessment in Indonesian Classrooms | Jun 5, 2025 | Multiple-choice | —Unverified | 0 |
| Evaluating Visual and Cultural Interpretation: The K-Viscuit Benchmark with Human-VLM Collaboration | Jun 24, 2024 | DiversityMultiple-choice | —Unverified | 0 |
| Evaluation of Automatically Generated Pronoun Reference Questions | Sep 1, 2017 | Multiple-choiceReading Comprehension | —Unverified | 0 |
| Examining the Behavior of LLM Architectures Within the Framework of Standardized National Exams in Brazil | Aug 9, 2024 | MathMultiple-choice | —Unverified | 0 |
| Towards Geo-Culturally Grounded LLM Generations | Feb 19, 2025 | Multiple-choiceRetrieval-augmented Generation | —Unverified | 0 |
| Towards Integrated Glance To Restructuring in Combinatorial Optimization | Dec 20, 2015 | ClusteringCombinatorial Optimization | —Unverified | 0 |
| ExplanationLP: Abductive Reasoning for Explainable Science Question Answering | Oct 25, 2020 | Answer SelectionARC | —Unverified | 0 |
| Towards Mixed-Precision Quantization of Neural Networks via Constrained Optimization | Oct 13, 2021 | Multiple-choiceQuantization | —Unverified | 0 |
| Explore then Determine: A GNN-LLM Synergy Framework for Reasoning over Knowledge Graph | Jun 3, 2024 | Knowledge GraphsMultiple-choice | —Unverified | 0 |
| Exploring syntactic information in sentence embeddings through multilingual subject-verb agreement | Sep 10, 2024 | Multiple-choiceSentence | —Unverified | 0 |
| Exploring the Capabilities of Prompted Large Language Models in Educational and Assessment Applications | May 19, 2024 | Multiple-choice | —Unverified | 0 |
| Exploring the Comprehension of ChatGPT in Traditional Chinese Medicine Knowledge | Mar 14, 2024 | Multiple-choice | —Unverified | 0 |
| How Additional Knowledge can Improve Natural Language Commonsense Question Answering? | Sep 19, 2019 | ArticlesLanguage Modeling | —Unverified | 0 |
| Exposing the Limits of Video-Text Models through Contrast Sets | Jan 16, 2022 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Towards Multilingual LLM Evaluation for Baltic and Nordic languages: A study on Lithuanian History | Jan 15, 2025 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| FactTest: Factuality Testing in Large Language Models with Finite-Sample and Distribution-Free Guarantees | Nov 4, 2024 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| Towards Multistage Design of Modular Systems | Jun 19, 2013 | Multiple-choice | —Unverified | 0 |
| FAMULUS: Interactive Annotation and Feedback Generation for Teaching Diagnostic Reasoning | Aug 29, 2019 | DiagnosticMultiple-choice | —Unverified | 0 |
| FarsEval-PKBETS: A new diverse benchmark for evaluating Persian large language models | Apr 20, 2025 | DescriptiveEthics | —Unverified | 0 |
| Town Hall Debate Prompting: Enhancing Logical Reasoning in LLMs through Multi-Persona Interaction | Jan 28, 2025 | Logical ReasoningMultiple-choice | —Unverified | 0 |
| FAVOR-Bench: A Comprehensive Benchmark for Fine-Grained Video Motion Understanding | Mar 19, 2025 | BenchmarkingMultiple-choice | —Unverified | 0 |
| Few-Shot Image Classification and Segmentation as Visual Question Answering Using Vision-Language Models | Mar 15, 2024 | Few-Shot Image Classificationimage-classification | —Unverified | 0 |
| Field-testing items using artificial intelligence: Natural language processing with transformers | Oct 18, 2023 | Multiple-choice | —Unverified | 0 |