| Instruction Fine-Tuning: Does Prompt Loss Matter? | Jan 24, 2024 | Multiple-choicetoken-classification | —Unverified | 0 |
| A Study on Large Language Models' Limitations in Multiple-Choice Question Answering | Jan 15, 2024 | Multiple-choiceQuestion Answering | CodeCode Available | 0 |
| Towards Efficient Methods in Medical Question Answering using Knowledge Graph Embeddings | Jan 15, 2024 | Knowledge Graph EmbeddingsKnowledge Graphs | CodeCode Available | 0 |
| Assessing Large Language Models in Mechanical Engineering Education: A Study on Mechanics-Focused Conceptual Understanding | Jan 13, 2024 | Multiple-choicePrompt Engineering | —Unverified | 0 |
| Automated Answer Validation using Text Similarity | Jan 13, 2024 | Information RetrievalMultiple-choice | —Unverified | 0 |
| PUB: A Pragmatics Understanding Benchmark for Assessing LLMs' Pragmatics Capabilities | Jan 13, 2024 | Instruction FollowingMultiple-choice | —Unverified | 0 |
| A Novel Multi-Stage Prompting Approach for Language Agnostic MCQ Generation using GPT | Jan 13, 2024 | Distractor GenerationMultiple-choice | CodeCode Available | 0 |
| The Benefits of a Concise Chain of Thought on Problem-Solving in Large Language Models | Jan 11, 2024 | MathMultiple-choice | CodeCode Available | 1 |
| A Joint-Reasoning based Disease Q&A System | Jan 6, 2024 | Knowledge GraphsMisinformation | —Unverified | 0 |
| SEED-Bench: Benchmarking Multimodal Large Language Models | Jan 1, 2024 | BenchmarkingImage Generation | CodeCode Available | 3 |
| The Earth is Flat? Unveiling Factual Errors in Large Language Models | Jan 1, 2024 | In-Context LearningMultiple-choice | —Unverified | 0 |
| FusionMind -- Improving question and answering with external context fusion | Dec 31, 2023 | Knowledge GraphsMultiple-choice | —Unverified | 0 |
| SecQA: A Concise Question-Answering Dataset for Evaluating Large Language Models in Computer Security | Dec 26, 2023 | Computer SecurityMultiple-choice | CodeCode Available | 0 |
| RoleEval: A Bilingual Role Evaluation Benchmark for Large Language Models | Dec 26, 2023 | MemorizationMultiple-choice | CodeCode Available | 1 |
| HyKGE: A Hypothesis Knowledge Graph Enhanced Framework for Accurate and Reliable Medical LLMs Responses | Dec 26, 2023 | DiversityKnowledge Graphs | CodeCode Available | 1 |
| Towards a Unified Multimodal Reasoning Framework | Dec 22, 2023 | Multimodal ReasoningMultiple-choice | CodeCode Available | 0 |
| Perception Test 2023: A Summary of the First Challenge And Outcome | Dec 20, 2023 | BenchmarkingGrounded Video Question Answering | —Unverified | 0 |
| BloomVQA: Assessing Hierarchical Multi-modal Comprehension | Dec 20, 2023 | Data AugmentationMemorization | —Unverified | 0 |
| Multiple Hypothesis Dropout: Estimating the Parameters of Multi-Modal Output Distributions | Dec 18, 2023 | Multiple-choicePedestrian Trajectory Prediction | CodeCode Available | 0 |
| An In-depth Look at Gemini's Language Abilities | Dec 18, 2023 | Instruction FollowingMath | CodeCode Available | 1 |
| Marathon: A Race Through the Realm of Long Context with Large Language Models | Dec 15, 2023 | Long-Context UnderstandingMultiple-choice | CodeCode Available | 1 |
| Self-Evaluation Improves Selective Generation in Large Language Models | Dec 14, 2023 | Multiple-choiceTruthfulQA | —Unverified | 0 |
| A Foundational Multimodal Vision Language AI Assistant for Human Pathology | Dec 13, 2023 | Decision MakingDiagnostic | —Unverified | 0 |
| Steering Llama 2 via Contrastive Activation Addition | Dec 9, 2023 | Multiple-choice | CodeCode Available | 2 |
| Is Bigger and Deeper Always Better? Probing LLaMA Across Scales and Layers | Dec 7, 2023 | MathMultiple-choice | CodeCode Available | 1 |