| MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding | Jun 13, 2024 | Multiple-choiceScene Understanding | CodeCode Available | 1 |
| DefAn: Definitive Answer Dataset for LLMs Hallucination Evaluation | Jun 13, 2024 | BenchmarkingHallucination | CodeCode Available | 0 |
| OLMES: A Standard for Language Model Evaluations | Jun 12, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Open-LLM-Leaderboard: From Multi-choice to Open-style Questions for LLMs Evaluation, Benchmark, and Arena | Jun 11, 2024 | Multiple-choiceSelection bias | CodeCode Available | 2 |
| VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs | Jun 11, 2024 | Multiple-choiceQuestion Answering | CodeCode Available | 5 |
| BertaQA: How Much Do Language Models Know About Local Culture? | Jun 11, 2024 | Multiple-choiceTransfer Learning | CodeCode Available | 0 |
| Towards a Personal Health Large Language Model | Jun 10, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Decision-Making Behavior Evaluation Framework for LLMs under Uncertain Context | Jun 10, 2024 | Decision MakingMultiple-choice | —Unverified | 0 |
| Investigating and Addressing Hallucinations of LLMs in Tasks Involving Negation | Jun 8, 2024 | Abstractive Text SummarizationDialogue Generation | —Unverified | 0 |
| Do LLMs Recognize me, When I is not me: Assessment of LLMs Understanding of Turkish Indexical Pronouns in Indexical Shift Contexts | Jun 8, 2024 | Machine TranslationMultiple-choice | —Unverified | 0 |
| A Fine-tuning Dataset and Benchmark for Large Language Models for Protein Understanding | Jun 8, 2024 | DescriptiveLanguage Modelling | CodeCode Available | 1 |
| LLMs Are Not Intelligent Thinkers: Introducing Mathematical Topic Tree Benchmark for Comprehensive Evaluation of LLMs | Jun 7, 2024 | Mathematical ReasoningMultiple-choice | CodeCode Available | 0 |
| CRiskEval: A Chinese Multi-Level Risk Evaluation Benchmark Dataset for Large Language Models | Jun 7, 2024 | Multiple-choicePhilosophy | CodeCode Available | 0 |
| M-QALM: A Benchmark to Assess Clinical Reading Comprehension and Knowledge Recall in Large Language Models via Question Answering | Jun 6, 2024 | abstractive question answeringClinical Knowledge | CodeCode Available | 0 |
| Why Has Predicting Downstream Capabilities of Frontier AI Models with Scale Remained Elusive? | Jun 6, 2024 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| Every Answer Matters: Evaluating Commonsense with Probabilistic Measures | Jun 6, 2024 | Common Sense ReasoningLanguage Modeling | CodeCode Available | 0 |
| Automating Turkish Educational Quiz Generation Using Large Language Models | Jun 5, 2024 | Multiple-choice | CodeCode Available | 0 |
| Order-Independence Without Fine Tuning | Jun 4, 2024 | Language ModellingMultiple-choice | CodeCode Available | 0 |
| TopViewRS: Vision-Language Models as Top-View Spatial Reasoners | Jun 4, 2024 | Multiple-choiceSpatial Reasoning | CodeCode Available | 1 |
| Multiple Choice Questions and Large Languages Models: A Case Study with Fictional Medical Data | Jun 4, 2024 | Clinical KnowledgeMultiple-choice | CodeCode Available | 0 |
| Explore then Determine: A GNN-LLM Synergy Framework for Reasoning over Knowledge Graph | Jun 3, 2024 | Knowledge GraphsMultiple-choice | —Unverified | 0 |
| Strengthened Symbol Binding Makes Large Language Models Reliable Multiple-Choice Selectors | Jun 3, 2024 | Multiple-choiceSelection bias | CodeCode Available | 0 |
| Evaluating Large Language Model Biases in Persona-Steered Generation | May 30, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| Student Answer Forecasting: Transformer-Driven Answer Choice Prediction for Language Learning | May 30, 2024 | MisconceptionsMultiple-choice | CodeCode Available | 0 |
| An Automatic Question Usability Evaluation Toolkit | May 30, 2024 | Multiple-choiceWord Embeddings | CodeCode Available | 0 |
| Automated Generation and Tagging of Knowledge Components from Multiple-Choice Questions | May 30, 2024 | Language ModellingLarge Language Model | CodeCode Available | 0 |
| DGRC: An Effective Fine-tuning Framework for Distractor Generation in Chinese Multi-choice Reading Comprehension | May 29, 2024 | Distractor GenerationMultiple-choice | —Unverified | 0 |
| Edinburgh Clinical NLP at MEDIQA-CORR 2024: Guiding Large Language Models with Hints | May 28, 2024 | Multiple-choiceSentence | —Unverified | 0 |
| Can We Trust LLMs? Mitigate Overconfidence Bias in LLMs through Knowledge Transfer | May 27, 2024 | Multiple-choiceSentiment Analysis | —Unverified | 0 |
| iREL at SemEval-2024 Task 9: Improving Conventional Prompting Methods for Brain Teasers | May 25, 2024 | Common Sense ReasoningMultiple-choice | CodeCode Available | 0 |
| Eliciting Informative Text Evaluations with Large Language Models | May 23, 2024 | Multiple-choicePrediction | CodeCode Available | 0 |
| Imagery as Inquiry: Exploring A Multimodal Dataset for Conversational Recommendation | May 23, 2024 | Conversational RecommendationMultiple-choice | —Unverified | 0 |
| Automated Evaluation of Retrieval-Augmented Language Models with Task-Specific Exam Generation | May 22, 2024 | InformativenessLanguage Modeling | CodeCode Available | 2 |
| Embedding Trajectory for Out-of-Distribution Detection in Mathematical Reasoning | May 22, 2024 | Mathematical ReasoningMultiple-choice | CodeCode Available | 1 |
| Robust portfolio optimization model for electronic coupon allocation | May 21, 2024 | Multiple-choicePortfolio Optimization | —Unverified | 0 |
| Multiple-Choice Questions are Efficient and Robust LLM Evaluators | May 20, 2024 | GSM8KHumanEval | CodeCode Available | 1 |
| Exploring the Capabilities of Prompted Large Language Models in Educational and Assessment Applications | May 19, 2024 | Multiple-choice | —Unverified | 0 |
| From Generalist to Specialist: Improving Large Language Models for Medical Physics Using ARCoT | May 17, 2024 | BenchmarkingMultiple-choice | —Unverified | 0 |
| Benchmarking Large Language Models on CFLUE -- A Chinese Financial Language Understanding Evaluation Dataset | May 17, 2024 | 16kBenchmarking | CodeCode Available | 3 |
| COGNET-MD, an evaluation framework and dataset for Large Language Model benchmarks in the medical domain | May 17, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| AmazUtah_NLP at SemEval-2024 Task 9: A MultiChoice Question Answering System for Commonsense Defying Reasoning | May 16, 2024 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| CinePile: A Long Video Question Answering Dataset and Benchmark | May 14, 2024 | FormHuman-Object Interaction Detection | —Unverified | 0 |
| SciFIBench: Benchmarking Large Multimodal Models for Scientific Figure Interpretation | May 14, 2024 | BenchmarkingMultiple-choice | CodeCode Available | 1 |
| MCS-SQL: Leveraging Multiple Prompts and Multiple-Choice Selection For Text-to-SQL Generation | May 13, 2024 | In-Context LearningMultiple-choice | —Unverified | 0 |
| Limited Ability of LLMs to Simulate Human Psychological Behaviours: a Psychometric Analysis | May 12, 2024 | Multiple-choiceQuestion Answering | CodeCode Available | 0 |
| THRONE: An Object-based Hallucination Benchmark for the Free-form Generations of Large Vision-Language Models | May 8, 2024 | AttributeData Augmentation | CodeCode Available | 1 |
| WorldQA: Multimodal World Knowledge in Videos through Long-Chain Reasoning | May 6, 2024 | Multiple-choiceVideo Understanding | —Unverified | 0 |
| Anchored Answers: Unravelling Positional Bias in GPT-2's Multiple-Choice Questions | May 6, 2024 | Decision MakingMultiple-choice | CodeCode Available | 0 |
| Self-Reflection in LLM Agents: Effects on Problem-Solving Performance | May 5, 2024 | Multiple-choice | CodeCode Available | 2 |
| Math Multiple Choice Question Generation via Human-Large Language Model Collaboration | May 1, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |