Multiple-choice

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 401–450 of 1107 papers

Title	Date	Tasks	Status
LogiDynamics: Unraveling the Dynamics of Logical Inference in Large Language Model Reasoning	Feb 16, 2025	Analogical questionsIn-Context Learning	—Unverified
VisCon-100K: Leveraging Contextual Web Data for Fine-tuning Vision Language Models	Feb 14, 2025	Image CaptioningLarge Language Model	—Unverified
Objective quantification of mood states using large language models	Feb 13, 2025	Multiple-choice	—Unverified
Truth Knows No Language: Evaluating Truthfulness Beyond English	Feb 13, 2025	InformativenessMachine Translation	CodeCode Available
A Semantic Parsing Algorithm to Solve Linear Ordering Problems	Feb 12, 2025	Multiple-choiceSemantic Parsing	—Unverified
SB-Bench: Stereotype Bias Benchmark for Large Multimodal Models	Feb 12, 2025	FairnessMultiple-choice	—Unverified
Break the Checkbox: Challenging Closed-Style Evaluations of Cultural Alignment in LLMs	Feb 12, 2025	Multiple-choiceSurvey	—Unverified
PerCul: A Story-Driven Cultural Evaluation of LLMs in Persian	Feb 11, 2025	Multiple-choice	—Unverified
Tokenization Standards for Linguistic Integrity: Turkish as a Benchmark	Feb 10, 2025	MMLUMorphological Analysis	—Unverified
HSI: Head-Specific Intervention Can Induce Misaligned AI Coordination in Large Language Models	Feb 9, 2025	Answer GenerationLanguage Modeling	CodeCode Available
Investigating the Shortcomings of LLMs in Step-by-Step Legal Reasoning	Feb 8, 2025	Legal ReasoningMultiple-choice	CodeCode Available
ARR: Question Answering with Large Language Models via Analyzing, Retrieving, and Reasoning	Feb 7, 2025	Multiple-choiceQuestion Answering	CodeCode Available
LLMs to Support a Domain Specific Knowledge Assistant	Feb 6, 2025	ChatbotMultiple-choice	—Unverified
The Order Effect: Investigating Prompt Sensitivity to Input Order in LLMs	Feb 6, 2025	Multiple-choiceSensitivity	—Unverified
Evalita-LLM: Benchmarking Large Language Models on Italian	Feb 4, 2025	BenchmarkingMultiple-choice	—Unverified
The Use of Artificial Intelligence Tools in Assessing Content Validity: A Comparative Study with Human Experts	Feb 3, 2025	Multiple-choiceReading Comprehension	—Unverified
CoddLLM: Empowering Large Language Models for Data Analytics	Feb 1, 2025	Multiple-choiceSynthetic Data Generation	—Unverified
InnerThoughts: Disentangling Representations and Predictions in Large Language Models	Jan 29, 2025	Multiple-choicePosition	—Unverified
Inferring from Logits: Exploring Best Practices for Decoding-Free Generative Candidate Selection	Jan 28, 2025	Multiple-choice	—Unverified
Town Hall Debate Prompting: Enhancing Logical Reasoning in LLMs through Multi-Persona Interaction	Jan 28, 2025	Logical ReasoningMultiple-choice	—Unverified
Attribution analysis of legal language as used by LLM	Jan 28, 2025	Binary ClassificationMultiple-choice	—Unverified
Options-Aware Dense Retrieval for Multiple-Choice query Answering	Jan 27, 2025	Multiple-choiceQuestion Answering	—Unverified
HardML: A Benchmark For Evaluating Data Science And Machine Learning knowledge and reasoning in AI	Jan 26, 2025	MMLUMultiple-choice	—Unverified
Option-ID Based Elimination For Multiple Choice Questions	Jan 25, 2025	Multiple-choice	CodeCode Available
LongReason: A Synthetic Long-Context Reasoning Benchmark via Context Expansion	Jan 25, 2025	Multiple-choiceReading Comprehension	—Unverified
LLM Evaluation Based on Aerospace Manufacturing Expertise: Automated Generation and Multi-Model Question Answering	Jan 25, 2025	Information RetrievalMultiple-choice	—Unverified
Humanity's Last Exam	Jan 24, 2025	Humanity's Last ExamLanguage Modeling	—Unverified
On the Reasoning Capacity of AI Models and How to Quantify It	Jan 23, 2025	MemorizationMMLU	—Unverified
Auto-Evaluation: A Critical Measure in Driving Improvements in Quality and Safety of AI-Generated Lesson Resources	Jan 23, 2025	Multiple-choice	—Unverified
Patent Figure Classification using Large Vision-language Models	Jan 22, 2025	ClassificationFew-Shot Learning	CodeCode Available
The AI Penalization Effect: People Reduce Compensation for Workers Who Use AI	Jan 22, 2025	Multiple-choice	—Unverified
Generating Plausible Distractors for Multiple-Choice Questions via Student Choice Prediction	Jan 21, 2025	Distractor GenerationMisconceptions	—Unverified
Can Multimodal LLMs do Visual Temporal Understanding and Reasoning? The answer is No!	Jan 18, 2025	Multiple-choiceQuestion Answering	—Unverified
Empowering Large Language Models in Wireless Communication: A Novel Dataset and Fine-Tuning Framework	Jan 16, 2025	Multiple-choiceQuestion Generation	—Unverified
Vision-Language Models Do Not Understand Negation	Jan 16, 2025	Multiple-choiceNegation	—Unverified
Multiple Choice Questions: Reasoning Makes Large Language Models (LLMs) More Self-Confident Even When They Are Wrong	Jan 16, 2025	Multiple-choice	—Unverified
Towards Multilingual LLM Evaluation for Baltic and Nordic languages: A study on Lithuanian History	Jan 15, 2025	Multiple-choiceQuestion Answering	—Unverified
Rethinking AI Cultural Alignment	Jan 13, 2025	Multiple-choice	—Unverified
Hierarchical Divide-and-Conquer for Fine-Grained Alignment in LLM-Based Medical Evaluation	Jan 12, 2025	AttributeMultiple-choice	—Unverified
First Token Probability Guided RAG for Telecom Question Answering	Jan 11, 2025	Multiple-choiceMultiple Choice Question Answering (MCQA)	—Unverified
Fleurs-SLU: A Massively Multilingual Benchmark for Spoken Language Understanding	Jan 10, 2025	Automatic Speech RecognitionClassification	CodeCode Available
Affordably Fine-tuned LLMs Provide Better Answers to Course-specific MCQs	Jan 10, 2025	Multiple-choice	CodeCode Available
Knowledge Retrieval Based on Generative AI	Jan 8, 2025	Large Language ModelMultiple-choice	—Unverified
DRIVINGVQA: Analyzing Visual Chain-of-Thought Reasoning of Vision Language Models in Real-World Scenarios with Driving Theory Tests	Jan 8, 2025	Multimodal ReasoningMultiple-choice	—Unverified
Localizing AI: Evaluating Open-Weight Language Models for Languages of Baltic States	Jan 7, 2025	Machine TranslationMultiple-choice	—Unverified
(WhyPHI) Fine-Tuning PHI-3 for Multiple-Choice Question Answering: Methodology, Results, and Challenges	Jan 3, 2025	Multiple-choiceQuestion Answering	CodeCode Available
CLIP-UP: CLIP-Based Unanswerable Problem Detection for Visual Question Answering	Jan 2, 2025	Multiple-choiceQuestion Answering	—Unverified
FSBench: A Figure Skating Benchmark for Advancing Artistic Sports Understanding	Jan 1, 2025	Action RecognitionMultiple-choice	CodeCode Available
Enhancing Video-LLM Reasoning via Agent-of-Thoughts Distillation	Jan 1, 2025	Language ModelingLanguage Modelling	—Unverified
Q-Bench-Video: Benchmark the Video Quality Understanding of LMMs	Jan 1, 2025	Multiple-choiceVideo Generation	—Unverified

Show:10 25 50

← PrevPage 9 of 23Next →

No leaderboard results yet.