Multiple-choice

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 276–300 of 1107 papers

Title	Date	Tasks	Status	Hype
Multiple Choice Learning for Efficient Speech Separation with Many Speakers	Nov 27, 2024	Multiple-choiceSpeech Separation	—Unverified	0
CHOICE: Benchmarking the Remote Sensing Capabilities of Large Vision-Language Models	Nov 27, 2024	BenchmarkingEarth Observation	CodeCode Available	1
NEMO: Can Multimodal LLMs Identify Attribute-Modified Objects?	Nov 26, 2024	AttributeMultiple-choice	—Unverified	0
GEMeX: A Large-Scale, Groundable, and Explainable Medical VQA Benchmark for Chest X-ray Diagnosis	Nov 25, 2024	Medical Visual Question AnsweringMultiple-choice	—Unverified	0
All Languages Matter: Evaluating LMMs on Culturally Diverse 100 Languages	Nov 25, 2024	AllLong Question Answer	CodeCode Available	1
SAGEval: The frontiers of Satisfactory Agent based NLG Evaluation for reference-free open-ended text	Nov 25, 2024	Language ModelingLanguage Modelling	—Unverified	0
AfriMed-QA: A Pan-African, Multi-Specialty, Medical Question-Answering Benchmark Dataset	Nov 23, 2024	Language ModelingLanguage Modelling	—Unverified	0
VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation	Nov 20, 2024	ChatbotMultiple-choice	—Unverified	0
Testing Uncertainty of Large Language Models for Physics Knowledge and Reasoning	Nov 18, 2024	Logical ReasoningMultiple-choice	—Unverified	0
VidComposition: Can MLLMs Analyze Compositions in Compiled Videos?	Nov 17, 2024	Multiple-choice	CodeCode Available	1
A Benchmark for Long-Form Medical Question Answering	Nov 14, 2024	Answer GenerationForm	CodeCode Available	0
DAHL: Domain-specific Automated Hallucination Evaluation of Long-Form Text through a Benchmark Dataset in Biomedicine	Nov 14, 2024	FormHallucination	CodeCode Available	0
TRACE: Transformer-based Risk Assessment for Clinical Evaluation	Nov 13, 2024	Decision MakingMissing Values	CodeCode Available	0
IdentifyMe: A Challenging Long-Context Mention Resolution Benchmark for LLMs	Nov 12, 2024	coreference-resolutionCoreference Resolution	CodeCode Available	0
SHARP: Unlocking Interactive Hallucination via Stance Transfer in Role-Playing Agents	Nov 12, 2024	General KnowledgeHallucination	—Unverified	0
StoryTeller: Improving Long Video Description through Global Audio-Visual Character Identification	Nov 11, 2024	Large Language ModelMultimodal Large Language Model	CodeCode Available	2
Probabilistic Consensus through Ensemble Validation: A Framework for LLM Reliability	Nov 10, 2024	Multiple-choiceText Generation	—Unverified	0
Humans and Large Language Models in Clinical Decision Support: A Study with Medical Calculators	Nov 8, 2024	Decision MakingMultiple-choice	—Unverified	0
Quantitative Assessment of Intersectional Empathetic Bias and Understanding	Nov 8, 2024	Multiple-choice	CodeCode Available	0
ProverbEval: Exploring LLM Evaluation Challenges for Low-resource Language Understanding	Nov 7, 2024	BenchmarkingMultiple-choice	—Unverified	0
HourVideo: 1-Hour Video-Language Understanding	Nov 7, 2024	Benchmarkingcounterfactual	CodeCode Available	2
MEG: Medical Knowledge-Augmented Large Language Models for Question Answering	Nov 6, 2024	Knowledge Graph EmbeddingsMultiple-choice	CodeCode Available	1
MILU: A Multi-task Indic Language Understanding Benchmark	Nov 4, 2024	Multiple-choiceQuestion Answering	CodeCode Available	1
FactTest: Factuality Testing in Large Language Models with Finite-Sample and Distribution-Free Guarantees	Nov 4, 2024	Multiple-choiceQuestion Answering	—Unverified	0
PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance	Nov 4, 2024	Caption GenerationMultiple-choice	CodeCode Available	2

Show:10 25 50

← PrevPage 12 of 45Next →

No leaderboard results yet.