| A Comparative Study of AI-Generated (GPT-4) and Human-crafted MCQs in Programming Education | Dec 5, 2023 | Multiple-choice | —Unverified | 0 |
| Unleashing the Potential of Large Language Model: Zero-shot VQA for Flood Disaster Scenario | Dec 4, 2023 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Explanatory Argument Extraction of Correct Answers in Resident Medical Exams | Dec 1, 2023 | Multiple-choice | CodeCode Available | 0 |
| Evaluating the Rationale Understanding of Critical Reasoning in Logical Reading Comprehension | Nov 30, 2023 | Multiple-choiceReading Comprehension | —Unverified | 0 |
| Biomedical knowledge graph-optimized prompt generation for large language models | Nov 29, 2023 | BenchmarkingKnowledge Graphs | CodeCode Available | 2 |
| CLOMO: Counterfactual Logical Modification with Large Language Models | Nov 29, 2023 | counterfactualCounterfactual Reasoning | CodeCode Available | 0 |
| SEED-Bench-2: Benchmarking Multimodal Large Language Models | Nov 28, 2023 | BenchmarkingImage Generation | CodeCode Available | 2 |
| MVBench: A Comprehensive Multi-modal Video Understanding Benchmark | Nov 28, 2023 | 3D Question Answering (3D-QA)Diagnostic | CodeCode Available | 2 |
| GPQA: A Graduate-Level Google-Proof Q&A Benchmark | Nov 20, 2023 | Multiple-choice | CodeCode Available | 2 |
| Downstream Trade-offs of a Family of Text Watermarks | Nov 16, 2023 | FormLanguage Modelling | CodeCode Available | 0 |
| Video-LLaVA: Learning United Visual Representation by Alignment Before Projection | Nov 16, 2023 | Language ModelingLanguage Modelling | CodeCode Available | 4 |
| ConceptPsy:A Benchmark Suite with Conceptual Comprehensiveness in Psychology | Nov 16, 2023 | MMLUMultiple-choice | —Unverified | 0 |
| Investigating Data Contamination in Modern Benchmarks for Large Language Models | Nov 16, 2023 | Common Sense ReasoningMMLU | —Unverified | 0 |
| Evaluating LLMs on Document-Based QA: Exact Answer Selection and Numerical Extraction using Cogtale dataset | Nov 14, 2023 | Answer SelectionInformation Retrieval | —Unverified | 0 |
| It's Not Easy Being Wrong: Large Language Models Struggle with Process of Elimination Reasoning | Nov 13, 2023 | Multiple-choice | CodeCode Available | 0 |
| Data Contamination Quiz: A Tool to Detect and Estimate Contamination in Large Language Models | Nov 10, 2023 | GSM8KMemorization | CodeCode Available | 1 |
| Fake Alignment: Are LLMs Really Aligned Well? | Nov 10, 2023 | Multiple-choice | CodeCode Available | 1 |
| Characterizing Large Language Models as Rationalizers of Knowledge-intensive Tasks | Nov 9, 2023 | Multiple-choiceWorld Knowledge | —Unverified | 0 |
| Assessing Distractors in Multiple-Choice Tests | Nov 8, 2023 | DiversityMultiple-choice | —Unverified | 0 |
| Evaluating multiple large language models in pediatric ophthalmology | Nov 7, 2023 | Multiple-choice | —Unverified | 0 |
| Evaluating the Potential of Leading Large Language Models in Reasoning Biology Questions | Nov 5, 2023 | Logical ReasoningMultiple-choice | —Unverified | 0 |
| More Robots are Coming: Large Multimodal Models (ChatGPT) can Solve Visually Diverse Images of Parsons Problems | Nov 3, 2023 | Multiple-choice | —Unverified | 0 |
| CASE: Commonsense-Augmented Score with an Expanded Answer Space | Nov 3, 2023 | Multiple-choice | CodeCode Available | 0 |
| Resilient Multiple Choice Learning: A learned scoring scheme with application to audio scene analysis | Nov 2, 2023 | Density EstimationDiversity | CodeCode Available | 1 |
| An Open Source Data Contamination Report for Large Language Models | Oct 26, 2023 | HellaSwagLanguage Modeling | CodeCode Available | 1 |