| QRMeM: Unleash the Length Limitation through Question then Reflection Memory Mechanism | Jun 19, 2024 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| Enhancing Distractor Generation for Multiple-Choice Questions with Retrieval Augmented Pretraining and Knowledge Graph Integration | Jun 19, 2024 | BenchmarkingDistractor Generation | —Unverified | 0 |
| On the Principles behind Opinion Dynamics in Multi-Agent Systems of Large Language Models | Jun 18, 2024 | Multiple-choice | —Unverified | 0 |
| Aqulia-Med LLM: Pioneering Full-Process Open-Source Medical Language Models | Jun 18, 2024 | Multiple-choice | —Unverified | 0 |
| QOG:Question and Options Generation based on Language Model | Jun 18, 2024 | Information RetrievalLanguage Modeling | —Unverified | 0 |
| UBENCH: Benchmarking Uncertainty in Large Language Models with Multiple Choice Questions | Jun 18, 2024 | BenchmarkingMultiple-choice | CodeCode Available | 0 |
| DetectBench: Can Large Language Model Detect and Piece Together Implicit Evidence? | Jun 18, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| IPEval: A Bilingual Intellectual Property Agency Consultation Evaluation Benchmark for Large Language Models | Jun 18, 2024 | ManagementMultiple-choice | CodeCode Available | 0 |
| Grade Score: Quantifying LLM Performance in Option Selection | Jun 17, 2024 | Decision MakingFairness | CodeCode Available | 0 |
| Balancing Rigor and Utility: Mitigating Cognitive Biases in Large Language Models for Multiple-Choice Questions | Jun 16, 2024 | Decision MakingLanguage Modelling | CodeCode Available | 0 |
| VELOCITI: Benchmarking Video-Language Compositional Reasoning with Strict Entailment | Jun 16, 2024 | Action UnderstandingBenchmarking | —Unverified | 0 |
| VCEval: Rethinking What is a Good Educational Video and How to Automatically Evaluate It | Jun 15, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Evaluating ChatGPT-4 Vision on Brazil's National Undergraduate Computer Science Exam | Jun 14, 2024 | FairnessLogical Reasoning | CodeCode Available | 0 |
| DefAn: Definitive Answer Dataset for LLMs Hallucination Evaluation | Jun 13, 2024 | BenchmarkingHallucination | CodeCode Available | 0 |
| Bayesian Statistical Modeling with Predictors from LLMs | Jun 13, 2024 | Multiple-choice | —Unverified | 0 |
| AlignMMBench: Evaluating Chinese Multimodal Alignment in Large Vision-Language Models | Jun 13, 2024 | Multiple-choice | —Unverified | 0 |
| OLMES: A Standard for Language Model Evaluations | Jun 12, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| BertaQA: How Much Do Language Models Know About Local Culture? | Jun 11, 2024 | Multiple-choiceTransfer Learning | CodeCode Available | 0 |
| Decision-Making Behavior Evaluation Framework for LLMs under Uncertain Context | Jun 10, 2024 | Decision MakingMultiple-choice | —Unverified | 0 |
| Towards a Personal Health Large Language Model | Jun 10, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Investigating and Addressing Hallucinations of LLMs in Tasks Involving Negation | Jun 8, 2024 | Abstractive Text SummarizationDialogue Generation | —Unverified | 0 |
| Do LLMs Recognize me, When I is not me: Assessment of LLMs Understanding of Turkish Indexical Pronouns in Indexical Shift Contexts | Jun 8, 2024 | Machine TranslationMultiple-choice | —Unverified | 0 |
| CRiskEval: A Chinese Multi-Level Risk Evaluation Benchmark Dataset for Large Language Models | Jun 7, 2024 | Multiple-choicePhilosophy | CodeCode Available | 0 |
| LLMs Are Not Intelligent Thinkers: Introducing Mathematical Topic Tree Benchmark for Comprehensive Evaluation of LLMs | Jun 7, 2024 | Mathematical ReasoningMultiple-choice | CodeCode Available | 0 |
| Every Answer Matters: Evaluating Commonsense with Probabilistic Measures | Jun 6, 2024 | Common Sense ReasoningLanguage Modeling | CodeCode Available | 0 |
| M-QALM: A Benchmark to Assess Clinical Reading Comprehension and Knowledge Recall in Large Language Models via Question Answering | Jun 6, 2024 | abstractive question answeringClinical Knowledge | CodeCode Available | 0 |
| Why Has Predicting Downstream Capabilities of Frontier AI Models with Scale Remained Elusive? | Jun 6, 2024 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| Automating Turkish Educational Quiz Generation Using Large Language Models | Jun 5, 2024 | Multiple-choice | CodeCode Available | 0 |
| Multiple Choice Questions and Large Languages Models: A Case Study with Fictional Medical Data | Jun 4, 2024 | Clinical KnowledgeMultiple-choice | CodeCode Available | 0 |
| Order-Independence Without Fine Tuning | Jun 4, 2024 | Language ModellingMultiple-choice | CodeCode Available | 0 |
| Strengthened Symbol Binding Makes Large Language Models Reliable Multiple-Choice Selectors | Jun 3, 2024 | Multiple-choiceSelection bias | CodeCode Available | 0 |
| Explore then Determine: A GNN-LLM Synergy Framework for Reasoning over Knowledge Graph | Jun 3, 2024 | Knowledge GraphsMultiple-choice | —Unverified | 0 |
| Student Answer Forecasting: Transformer-Driven Answer Choice Prediction for Language Learning | May 30, 2024 | MisconceptionsMultiple-choice | CodeCode Available | 0 |
| An Automatic Question Usability Evaluation Toolkit | May 30, 2024 | Multiple-choiceWord Embeddings | CodeCode Available | 0 |
| Evaluating Large Language Model Biases in Persona-Steered Generation | May 30, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| Automated Generation and Tagging of Knowledge Components from Multiple-Choice Questions | May 30, 2024 | Language ModellingLarge Language Model | CodeCode Available | 0 |
| DGRC: An Effective Fine-tuning Framework for Distractor Generation in Chinese Multi-choice Reading Comprehension | May 29, 2024 | Distractor GenerationMultiple-choice | —Unverified | 0 |
| Edinburgh Clinical NLP at MEDIQA-CORR 2024: Guiding Large Language Models with Hints | May 28, 2024 | Multiple-choiceSentence | —Unverified | 0 |
| Can We Trust LLMs? Mitigate Overconfidence Bias in LLMs through Knowledge Transfer | May 27, 2024 | Multiple-choiceSentiment Analysis | —Unverified | 0 |
| iREL at SemEval-2024 Task 9: Improving Conventional Prompting Methods for Brain Teasers | May 25, 2024 | Common Sense ReasoningMultiple-choice | CodeCode Available | 0 |
| Eliciting Informative Text Evaluations with Large Language Models | May 23, 2024 | Multiple-choicePrediction | CodeCode Available | 0 |
| Imagery as Inquiry: Exploring A Multimodal Dataset for Conversational Recommendation | May 23, 2024 | Conversational RecommendationMultiple-choice | —Unverified | 0 |
| Robust portfolio optimization model for electronic coupon allocation | May 21, 2024 | Multiple-choicePortfolio Optimization | —Unverified | 0 |
| Exploring the Capabilities of Prompted Large Language Models in Educational and Assessment Applications | May 19, 2024 | Multiple-choice | —Unverified | 0 |
| COGNET-MD, an evaluation framework and dataset for Large Language Model benchmarks in the medical domain | May 17, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| From Generalist to Specialist: Improving Large Language Models for Medical Physics Using ARCoT | May 17, 2024 | BenchmarkingMultiple-choice | —Unverified | 0 |
| AmazUtah_NLP at SemEval-2024 Task 9: A MultiChoice Question Answering System for Commonsense Defying Reasoning | May 16, 2024 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| CinePile: A Long Video Question Answering Dataset and Benchmark | May 14, 2024 | FormHuman-Object Interaction Detection | —Unverified | 0 |
| MCS-SQL: Leveraging Multiple Prompts and Multiple-Choice Selection For Text-to-SQL Generation | May 13, 2024 | In-Context LearningMultiple-choice | —Unverified | 0 |
| Limited Ability of LLMs to Simulate Human Psychological Behaviours: a Psychometric Analysis | May 12, 2024 | Multiple-choiceQuestion Answering | CodeCode Available | 0 |