| DeSIQ: Towards an Unbiased, Challenging Benchmark for Social Intelligence Understanding | Oct 24, 2023 | Language ModelingLanguage Modelling | —Unverified | 0 |
| POE: Process of Elimination for Multiple Choice Reasoning | Oct 24, 2023 | In-Context LearningLogical Reasoning | CodeCode Available | 0 |
| Dataset Bias Mitigation in Multiple-Choice Visual Question Answering and Beyond | Oct 23, 2023 | counterfactualMultiple-choice | —Unverified | 0 |
| StoryAnalogy: Deriving Story-level Analogies from Large Language Models to Unlock Analogical Understanding | Oct 19, 2023 | Multiple-choiceNatural Language Understanding | CodeCode Available | 0 |
| Field-testing items using artificial intelligence: Natural language processing with transformers | Oct 18, 2023 | Multiple-choice | —Unverified | 0 |
| Investigating Uncertainty Calibration of Aligned Language Models under the Multiple-Choice Setting | Oct 18, 2023 | Multiple-choice | —Unverified | 0 |
| Evaluating the Symbol Binding Ability of Large Language Models for Multiple-Choice Questions in Vietnamese General Education | Oct 18, 2023 | Multiple-choiceMultiple Choice Question Answering (MCQA) | —Unverified | 0 |
| JMedLoRA:Medical Domain Adaptation on Japanese Large Language Models using Instruction-tuning | Oct 16, 2023 | Domain AdaptationMedical Question Answering | CodeCode Available | 1 |
| KGQuiz: Evaluating the Generalization of Encoded Knowledge in Large Language Models | Oct 15, 2023 | Multiple-choiceTriplet | CodeCode Available | 0 |
| Mitigating Bias for Question Answering Models by Tracking Bias Influence | Oct 13, 2023 | Multiple-choiceMulti-Task Learning | —Unverified | 0 |
| OpsEval: A Comprehensive IT Operations Benchmark Suite for Large Language Models | Oct 11, 2023 | HallucinationIn-Context Learning | CodeCode Available | 1 |
| BRAINTEASER: Lateral Thinking Puzzles for Large Language Models | Oct 8, 2023 | Distractor GenerationLanguage Modelling | CodeCode Available | 1 |
| Analyzing Zero-Shot Abilities of Vision-Language Models on Video Understanding Tasks | Oct 7, 2023 | Action RecognitionMultiple-choice | —Unverified | 0 |
| LLM-Coordination: Evaluating and Analyzing Multi-agent Coordination Abilities in Large Language Models | Oct 5, 2023 | Common Sense ReasoningMultiple-choice | CodeCode Available | 1 |
| On the Performance of Multimodal Language Models | Oct 4, 2023 | BenchmarkingBinary Classification | —Unverified | 0 |
| AutoCast++: Enhancing World Event Prediction with Zero-shot Ranking-based Context Retrieval | Oct 3, 2023 | ArticlesDecision Making | CodeCode Available | 0 |
| Can Large Language Models Provide Security & Privacy Advice? Measuring the Ability of LLMs to Refute Misconceptions | Oct 3, 2023 | MisconceptionsMultiple-choice | CodeCode Available | 0 |
| Language Models as Knowledge Bases for Visual Word Sense Disambiguation | Oct 3, 2023 | Image CaptioningMultiple-choice | CodeCode Available | 0 |
| Fusing Models with Complementary Expertise | Oct 2, 2023 | Multiple-choicetext-classification | CodeCode Available | 0 |
| Fool Your (Vision and) Language Model With Embarrassingly Simple Permutations | Oct 2, 2023 | In-Context LearningInstruction Following | CodeCode Available | 1 |
| Automating question generation from educational text | Sep 26, 2023 | Multiple-choiceQuestion Generation | —Unverified | 0 |
| HANS, are you clever? Clever Hans Effect Analysis of Neural Systems | Sep 21, 2023 | Decision MakingMultiple-choice | —Unverified | 0 |
| Exploring Iterative Enhancement for Improving Learnersourced Multiple-Choice Question Explanations with Large Language Models | Sep 19, 2023 | Explanation GenerationLanguage Modelling | CodeCode Available | 0 |
| Estimating Contamination via Perplexity: Quantifying Memorisation in Language Model Evaluation | Sep 19, 2023 | Language Model EvaluationLanguage Modeling | CodeCode Available | 1 |
| Benchmarks for Pirá 2.0, a Reading Comprehension Dataset about the Ocean, the Brazilian Coast, and Climate Change | Sep 19, 2023 | Generative Question AnsweringInformation Retrieval | —Unverified | 0 |