SOTAVerified

Multiple-choice

Papers

Showing 601650 of 1107 papers

TitleStatusHype
Instruction Fine-Tuning: Does Prompt Loss Matter?0
A Study on Large Language Models' Limitations in Multiple-Choice Question AnsweringCode0
Towards Efficient Methods in Medical Question Answering using Knowledge Graph EmbeddingsCode0
Assessing Large Language Models in Mechanical Engineering Education: A Study on Mechanics-Focused Conceptual Understanding0
Automated Answer Validation using Text Similarity0
PUB: A Pragmatics Understanding Benchmark for Assessing LLMs' Pragmatics Capabilities0
A Novel Multi-Stage Prompting Approach for Language Agnostic MCQ Generation using GPTCode0
The Benefits of a Concise Chain of Thought on Problem-Solving in Large Language ModelsCode1
A Joint-Reasoning based Disease Q&A System0
SEED-Bench: Benchmarking Multimodal Large Language ModelsCode3
The Earth is Flat? Unveiling Factual Errors in Large Language Models0
FusionMind -- Improving question and answering with external context fusion0
SecQA: A Concise Question-Answering Dataset for Evaluating Large Language Models in Computer SecurityCode0
RoleEval: A Bilingual Role Evaluation Benchmark for Large Language ModelsCode1
HyKGE: A Hypothesis Knowledge Graph Enhanced Framework for Accurate and Reliable Medical LLMs ResponsesCode1
Towards a Unified Multimodal Reasoning FrameworkCode0
Perception Test 2023: A Summary of the First Challenge And Outcome0
BloomVQA: Assessing Hierarchical Multi-modal Comprehension0
Multiple Hypothesis Dropout: Estimating the Parameters of Multi-Modal Output DistributionsCode0
An In-depth Look at Gemini's Language AbilitiesCode1
Marathon: A Race Through the Realm of Long Context with Large Language ModelsCode1
Self-Evaluation Improves Selective Generation in Large Language Models0
A Foundational Multimodal Vision Language AI Assistant for Human Pathology0
Steering Llama 2 via Contrastive Activation AdditionCode2
Is Bigger and Deeper Always Better? Probing LLaMA Across Scales and LayersCode1
A Comparative Study of AI-Generated (GPT-4) and Human-crafted MCQs in Programming Education0
Unleashing the Potential of Large Language Model: Zero-shot VQA for Flood Disaster Scenario0
Explanatory Argument Extraction of Correct Answers in Resident Medical ExamsCode0
Evaluating the Rationale Understanding of Critical Reasoning in Logical Reading Comprehension0
Biomedical knowledge graph-optimized prompt generation for large language modelsCode2
CLOMO: Counterfactual Logical Modification with Large Language ModelsCode0
SEED-Bench-2: Benchmarking Multimodal Large Language ModelsCode2
MVBench: A Comprehensive Multi-modal Video Understanding BenchmarkCode2
GPQA: A Graduate-Level Google-Proof Q&A BenchmarkCode2
Downstream Trade-offs of a Family of Text WatermarksCode0
Video-LLaVA: Learning United Visual Representation by Alignment Before ProjectionCode4
ConceptPsy:A Benchmark Suite with Conceptual Comprehensiveness in Psychology0
Investigating Data Contamination in Modern Benchmarks for Large Language Models0
Evaluating LLMs on Document-Based QA: Exact Answer Selection and Numerical Extraction using Cogtale dataset0
It's Not Easy Being Wrong: Large Language Models Struggle with Process of Elimination ReasoningCode0
Data Contamination Quiz: A Tool to Detect and Estimate Contamination in Large Language ModelsCode1
Fake Alignment: Are LLMs Really Aligned Well?Code1
Characterizing Large Language Models as Rationalizers of Knowledge-intensive Tasks0
Assessing Distractors in Multiple-Choice Tests0
Evaluating multiple large language models in pediatric ophthalmology0
Evaluating the Potential of Leading Large Language Models in Reasoning Biology Questions0
More Robots are Coming: Large Multimodal Models (ChatGPT) can Solve Visually Diverse Images of Parsons Problems0
CASE: Commonsense-Augmented Score with an Expanded Answer SpaceCode0
Resilient Multiple Choice Learning: A learned scoring scheme with application to audio scene analysisCode1
An Open Source Data Contamination Report for Large Language ModelsCode1
Show:102550
← PrevPage 13 of 23Next →

No leaderboard results yet.