| AILS-NTUA at SemEval-2024 Task 9: Cracking Brain Teasers: Transformer Models for Lateral Thinking Puzzles | Apr 1, 2024 | Common Sense ReasoningMultiple-choice | CodeCode Available | 0 |
| DyePack: Provably Flagging Test Set Contamination in LLMs Using Backdoors | May 29, 2025 | MMLUMultiple-choice | CodeCode Available | 0 |
| EAIRA: Establishing a Methodology for Evaluating AI Models as Scientific Research Assistants | Feb 27, 2025 | Multiple-choice | CodeCode Available | 0 |
| MMM: Multi-stage Multi-task Learning for Multi-choice Reading Comprehension | Oct 1, 2019 | Logical ReasoningMachine Reading Comprehension | CodeCode Available | 0 |
| Anchored Answers: Unravelling Positional Bias in GPT-2's Multiple-Choice Questions | May 6, 2024 | Decision MakingMultiple-choice | CodeCode Available | 0 |
| MM-PoE: Multiple Choice Reasoning via. Process of Elimination using Multi-Modal Models | Dec 10, 2024 | Multiple-choiceQuestion Answering | CodeCode Available | 0 |
| Pragmatic Competence Evaluation of Large Language Models for the Korean Language | Mar 19, 2024 | Few-Shot LearningMultiple-choice | CodeCode Available | 0 |
| Which is the Effective Way for Gaokao: Information Retrieval or Neural Networks? | Apr 1, 2017 | Information RetrievalMultiple-choice | CodeCode Available | 0 |
| Edu-Values: Towards Evaluating the Chinese Education Values of Large Language Models | Sep 19, 2024 | EthicsMultiple-choice | CodeCode Available | 0 |
| Investigating the Shortcomings of LLMs in Step-by-Step Legal Reasoning | Feb 8, 2025 | Legal ReasoningMultiple-choice | CodeCode Available | 0 |
| Precise Task Formalization Matters in Winograd Schema Evaluations | Oct 8, 2020 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| Towards a Unified Multimodal Reasoning Framework | Dec 22, 2023 | Multimodal ReasoningMultiple-choice | CodeCode Available | 0 |
| IPEval: A Bilingual Intellectual Property Agency Consultation Evaluation Benchmark for Large Language Models | Jun 18, 2024 | ManagementMultiple-choice | CodeCode Available | 0 |
| iREL at SemEval-2024 Task 9: Improving Conventional Prompting Methods for Brain Teasers | May 25, 2024 | Common Sense ReasoningMultiple-choice | CodeCode Available | 0 |
| Eliciting Informative Text Evaluations with Large Language Models | May 23, 2024 | Multiple-choicePrediction | CodeCode Available | 0 |
| ElimiNet: A Model for Eliminating Options for Reading Comprehension with Multiple Choice Questions | Apr 4, 2019 | Multiple-choiceReading Comprehension | CodeCode Available | 0 |
| Self-Recognition in Language Models | Jul 9, 2024 | Multiple-choice | CodeCode Available | 0 |
| EMBRACE: Evaluation and Modifications for Boosting RACE | May 15, 2023 | Machine Reading ComprehensionMultiple-choice | CodeCode Available | 0 |
| Can multiple-choice questions really be useful in detecting the abilities of LLMs? | Mar 26, 2024 | Multiple-choiceQuestion Answering | CodeCode Available | 0 |
| Modular Sentence Encoders: Separating Language Specialization from Cross-Lingual Alignment | Jul 20, 2024 | Contrastive LearningMultiple-choice | CodeCode Available | 0 |
| Increasing Probability Mass on Answer Choices Does Not Always Improve Accuracy | May 24, 2023 | In-Context LearningMultiple-choice | CodeCode Available | 0 |
| Is Your Large Language Model Knowledgeable or a Choices-Only Cheater? | Jul 2, 2024 | Graph MiningLanguage Modeling | CodeCode Available | 0 |
| Iterative Forward Tuning Boosts In-Context Learning in Language Models | May 22, 2023 | Decision MakingIn-Context Learning | CodeCode Available | 0 |
| Can We Guide a Multi-Hop Reasoning Language Model to Incrementally Learn at Each Single-Hop? | Oct 1, 2022 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| BnMMLU: Measuring Massive Multitask Language Understanding in Bengali | May 25, 2025 | General KnowledgeMMLU | CodeCode Available | 0 |
| It's Not Easy Being Wrong: Large Language Models Struggle with Process of Elimination Reasoning | Nov 13, 2023 | Multiple-choice | CodeCode Available | 0 |
| Investigating Prior Knowledge for Challenging Chinese Machine Reading Comprehension | Apr 21, 2019 | Data AugmentationLanguage Modelling | CodeCode Available | 0 |
| Joint Learning of Sentence Embeddings for Relevance and Entailment | May 16, 2016 | Decision MakingInformation Retrieval | CodeCode Available | 0 |
| Enhancing textual textbook question answering with large language models and retrieval augmented generation | Feb 5, 2024 | Multiple-choiceQuestion Answering | CodeCode Available | 0 |
| Kaleidoscope: In-language Exams for Massively Multilingual Vision Evaluation | Apr 9, 2025 | Multiple-choice | CodeCode Available | 0 |
| KGQuiz: Evaluating the Generalization of Encoded Knowledge in Large Language Models | Oct 15, 2023 | Multiple-choiceTriplet | CodeCode Available | 0 |
| AutoCast++: Enhancing World Event Prediction with Zero-shot Ranking-based Context Retrieval | Oct 3, 2023 | ArticlesDecision Making | CodeCode Available | 0 |
| Moving Beyond Medical Exam Questions: A Clinician-Annotated Dataset of Real-World Tasks and Ambiguity in Mental Healthcare | Feb 22, 2025 | Decision MakingMultiple-choice | CodeCode Available | 0 |
| Uncertainty quantification in fine-tuned LLMs using LoRA ensembles | Feb 19, 2024 | Multiple-choiceUncertainty Quantification | CodeCode Available | 0 |
| Evaluating and Mitigating Social Bias for Large Language Models in Open-ended Settings | Dec 9, 2024 | Multiple-choice | CodeCode Available | 0 |
| Evaluating ChatGPT-4 Vision on Brazil's National Undergraduate Computer Science Exam | Jun 14, 2024 | FairnessLogical Reasoning | CodeCode Available | 0 |
| VCRBench: Exploring Long-form Causal Reasoning Capabilities of Large Video Language Models | May 13, 2025 | FormMultiple-choice | CodeCode Available | 0 |
| Towards Democratizing Multilingual Large Language Models For Medicine Through A Two-Stage Instruction Fine-tuning Approach | Sep 9, 2024 | Computational EfficiencyContinual Pretraining | CodeCode Available | 0 |
| Evaluating Large Language Model Biases in Persona-Steered Generation | May 30, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| SeqSAM: Autoregressive Multiple Hypothesis Prediction for Medical Image Segmentation using SAM | Mar 12, 2025 | Image SegmentationMedical Image Segmentation | CodeCode Available | 0 |
| SCoRE: Benchmarking Long-Chain Reasoning in Commonsense Scenarios | Mar 8, 2025 | BenchmarkingDiagnostic | CodeCode Available | 0 |
| M-QALM: A Benchmark to Assess Clinical Reading Comprehension and Knowledge Recall in Large Language Models via Question Answering | Jun 6, 2024 | abstractive question answeringClinical Knowledge | CodeCode Available | 0 |
| Order-Independence Without Fine Tuning | Jun 4, 2024 | Language ModellingMultiple-choice | CodeCode Available | 0 |
| Towards Diverse Perspective Learning with Selection over Multiple Temporal Poolings | Mar 14, 2024 | Multiple-choiceTime Series | CodeCode Available | 0 |
| PROST: Physical Reasoning of Objects through Space and Time | Jun 7, 2021 | Multiple-choice | CodeCode Available | 0 |
| VEGAS: Towards Visually Explainable and Grounded Artificial Social Intelligence | Apr 3, 2025 | Multiple-choice | CodeCode Available | 0 |
| Evaluating Prompts Across Multiple Choice Tasks In a Zero-Shot Setting | Mar 29, 2022 | Multiple-choice | CodeCode Available | 0 |
| This Land is Your, My Land: Evaluating Geopolitical Biases in Language Models | May 24, 2023 | Language ModellingLarge Language Model | CodeCode Available | 0 |
| Evaluating the Instruction-following Abilities of Language Models using Knowledge Tasks | Oct 16, 2024 | Instruction FollowingMultiple-choice | CodeCode Available | 0 |
| Multi-class Hierarchical Question Classification for Multiple Choice Science Exams | Aug 15, 2019 | ClassificationGeneral Classification | CodeCode Available | 0 |