Multiple-choice

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 251–300 of 1107 papers

Title	Date	Tasks	Status	Hype
HashEvict: A Pre-Attention KV Cache Eviction Strategy using Locality-Sensitive Hashing	Dec 13, 2024	GPUMultiple-choice	—Unverified	0
A multimodal dataset for understanding the impact of mobile phones on remote online virtual education	Dec 13, 2024	EEGHead Pose Estimation	CodeCode Available	0
LLM Distillation for Efficient Few-Shot Multiple Choice Question Answering	Dec 13, 2024	Few-Shot LearningKnowledge Distillation	—Unverified	0
Does Multiple Choice Have a Future in the Age of Generative AI? A Posttest-only RCT	Dec 13, 2024	Multiple-choice	CodeCode Available	0
Filter-then-Generate: Large Language Models with Structure-Text Adapter for Knowledge Graph Completion	Dec 12, 2024	HallucinationKnowledge Graph Completion	CodeCode Available	1
Neptune: The Long Orbit to Benchmarking Long Video Understanding	Dec 12, 2024	BenchmarkingMultimodal Reasoning	CodeCode Available	2
MM-PoE: Multiple Choice Reasoning via. Process of Elimination using Multi-Modal Models	Dec 10, 2024	Multiple-choiceQuestion Answering	CodeCode Available	0
Evaluating and Mitigating Social Bias for Large Language Models in Open-ended Settings	Dec 9, 2024	Multiple-choice	CodeCode Available	0
ACQ: A Unified Framework for Automated Programmatic Creativity in Online Advertising	Dec 9, 2024	Multiple-choiceMulti-Task Learning	—Unverified	0
Learning to Correction: Explainable Feedback Generation for Visual Commonsense Reasoning Distractor	Dec 8, 2024	MisconceptionsMultiple-choice	CodeCode Available	0
MANTA: A Large-Scale Multi-View and Visual-Text Anomaly Detection Dataset for Tiny Objects	Dec 6, 2024	2kAnomaly Detection	—Unverified	0
GRAF: Graph Retrieval Augmented by Facts for Romanian Legal Multi-Choice Question Answering	Dec 5, 2024	Information RetrievalMultiple-choice	—Unverified	0
Establishing Task Scaling Laws via Compute-Efficient Model Ladders	Dec 5, 2024	Language ModelingLanguage Modelling	—Unverified	0
AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?	Dec 3, 2024	Multiple-choice	CodeCode Available	1
SailCompass: Towards Reproducible and Robust Evaluation for Southeast Asian Languages	Dec 2, 2024	Multiple-choice	CodeCode Available	1
The use of large language models to enhance cancer clinical trial educational materials	Dec 2, 2024	MisinformationMultiple-choice	—Unverified	0
Unlocking Video-LLM via Agent-of-Thoughts Distillation	Dec 2, 2024	Language ModelingLanguage Modelling	—Unverified	0
Noise Injection Reveals Hidden Capabilities of Sandbagging Language Models	Dec 2, 2024	MMLUMultiple-choice	CodeCode Available	0
Uhura: A Benchmark for Evaluating Scientific Question Answering and Truthfulness in Low-Resource African Languages	Dec 1, 2024	ARCMultiple-choice	—Unverified	0
KnowledgePrompts: Exploring the Abilities of Large Language Models to Solve Proportional Analogies via Knowledge-Enhanced Prompting	Dec 1, 2024	Multiple-choiceMultiple Choice Question Answering (MCQA)	CodeCode Available	0
VisOnlyQA: Large Vision Language Models Still Struggle with Visual Perception of Geometric Information	Dec 1, 2024	Multiple-choice	CodeCode Available	1
Cognitive Biases in Large Language Models: A Survey and Mitigation Experiments	Nov 30, 2024	Multiple-choice	—Unverified	0
Perception Test 2024: Challenge Summary and a Novel Hour-Long VideoQA Benchmark	Nov 29, 2024	BenchmarkingGrounded Video Question Answering	—Unverified	0
Applying IRT to Distinguish Between Human and Generative AI Responses to Multiple-Choice Assessments	Nov 28, 2024	Multiple-choice	—Unverified	0
Sparse Attention Vectors: Generative Multimodal Model Features Are Discriminative Vision-Language Classifiers	Nov 28, 2024	Image Captioningimage-classification	—Unverified	0
Multiple Choice Learning for Efficient Speech Separation with Many Speakers	Nov 27, 2024	Multiple-choiceSpeech Separation	—Unverified	0
CHOICE: Benchmarking the Remote Sensing Capabilities of Large Vision-Language Models	Nov 27, 2024	BenchmarkingEarth Observation	CodeCode Available	1
NEMO: Can Multimodal LLMs Identify Attribute-Modified Objects?	Nov 26, 2024	AttributeMultiple-choice	—Unverified	0
GEMeX: A Large-Scale, Groundable, and Explainable Medical VQA Benchmark for Chest X-ray Diagnosis	Nov 25, 2024	Medical Visual Question AnsweringMultiple-choice	—Unverified	0
All Languages Matter: Evaluating LMMs on Culturally Diverse 100 Languages	Nov 25, 2024	AllLong Question Answer	CodeCode Available	1
SAGEval: The frontiers of Satisfactory Agent based NLG Evaluation for reference-free open-ended text	Nov 25, 2024	Language ModelingLanguage Modelling	—Unverified	0
AfriMed-QA: A Pan-African, Multi-Specialty, Medical Question-Answering Benchmark Dataset	Nov 23, 2024	Language ModelingLanguage Modelling	—Unverified	0
VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation	Nov 20, 2024	ChatbotMultiple-choice	—Unverified	0
Testing Uncertainty of Large Language Models for Physics Knowledge and Reasoning	Nov 18, 2024	Logical ReasoningMultiple-choice	—Unverified	0
VidComposition: Can MLLMs Analyze Compositions in Compiled Videos?	Nov 17, 2024	Multiple-choice	CodeCode Available	1
A Benchmark for Long-Form Medical Question Answering	Nov 14, 2024	Answer GenerationForm	CodeCode Available	0
DAHL: Domain-specific Automated Hallucination Evaluation of Long-Form Text through a Benchmark Dataset in Biomedicine	Nov 14, 2024	FormHallucination	CodeCode Available	0
TRACE: Transformer-based Risk Assessment for Clinical Evaluation	Nov 13, 2024	Decision MakingMissing Values	CodeCode Available	0
IdentifyMe: A Challenging Long-Context Mention Resolution Benchmark for LLMs	Nov 12, 2024	coreference-resolutionCoreference Resolution	CodeCode Available	0
SHARP: Unlocking Interactive Hallucination via Stance Transfer in Role-Playing Agents	Nov 12, 2024	General KnowledgeHallucination	—Unverified	0
StoryTeller: Improving Long Video Description through Global Audio-Visual Character Identification	Nov 11, 2024	Large Language ModelMultimodal Large Language Model	CodeCode Available	2
Probabilistic Consensus through Ensemble Validation: A Framework for LLM Reliability	Nov 10, 2024	Multiple-choiceText Generation	—Unverified	0
Humans and Large Language Models in Clinical Decision Support: A Study with Medical Calculators	Nov 8, 2024	Decision MakingMultiple-choice	—Unverified	0
Quantitative Assessment of Intersectional Empathetic Bias and Understanding	Nov 8, 2024	Multiple-choice	CodeCode Available	0
ProverbEval: Exploring LLM Evaluation Challenges for Low-resource Language Understanding	Nov 7, 2024	BenchmarkingMultiple-choice	—Unverified	0
HourVideo: 1-Hour Video-Language Understanding	Nov 7, 2024	Benchmarkingcounterfactual	CodeCode Available	2
MEG: Medical Knowledge-Augmented Large Language Models for Question Answering	Nov 6, 2024	Knowledge Graph EmbeddingsMultiple-choice	CodeCode Available	1
MILU: A Multi-task Indic Language Understanding Benchmark	Nov 4, 2024	Multiple-choiceQuestion Answering	CodeCode Available	1
FactTest: Factuality Testing in Large Language Models with Finite-Sample and Distribution-Free Guarantees	Nov 4, 2024	Multiple-choiceQuestion Answering	—Unverified	0
PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance	Nov 4, 2024	Caption GenerationMultiple-choice	CodeCode Available	2

Show:10 25 50

← PrevPage 6 of 23Next →

No leaderboard results yet.