| The Earth is Flat? Unveiling Factual Errors in Large Language Models | Jan 1, 2024 | In-Context LearningMultiple-choice | —Unverified | 0 |
| FusionMind -- Improving question and answering with external context fusion | Dec 31, 2023 | Knowledge GraphsMultiple-choice | —Unverified | 0 |
| SecQA: A Concise Question-Answering Dataset for Evaluating Large Language Models in Computer Security | Dec 26, 2023 | Computer SecurityMultiple-choice | CodeCode Available | 0 |
| RoleEval: A Bilingual Role Evaluation Benchmark for Large Language Models | Dec 26, 2023 | MemorizationMultiple-choice | CodeCode Available | 1 |
| HyKGE: A Hypothesis Knowledge Graph Enhanced Framework for Accurate and Reliable Medical LLMs Responses | Dec 26, 2023 | DiversityKnowledge Graphs | CodeCode Available | 1 |
| Towards a Unified Multimodal Reasoning Framework | Dec 22, 2023 | Multimodal ReasoningMultiple-choice | CodeCode Available | 0 |
| Perception Test 2023: A Summary of the First Challenge And Outcome | Dec 20, 2023 | BenchmarkingGrounded Video Question Answering | —Unverified | 0 |
| BloomVQA: Assessing Hierarchical Multi-modal Comprehension | Dec 20, 2023 | Data AugmentationMemorization | —Unverified | 0 |
| Multiple Hypothesis Dropout: Estimating the Parameters of Multi-Modal Output Distributions | Dec 18, 2023 | Multiple-choicePedestrian Trajectory Prediction | CodeCode Available | 0 |
| An In-depth Look at Gemini's Language Abilities | Dec 18, 2023 | Instruction FollowingMath | CodeCode Available | 1 |