Multiple-choice

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 101–150 of 1107 papers

Title	Date	Tasks	Status	Hype
VisOnlyQA: Large Vision Language Models Still Struggle with Visual Perception of Geometric Information	Dec 1, 2024	Multiple-choice	CodeCode Available	1
CHOICE: Benchmarking the Remote Sensing Capabilities of Large Vision-Language Models	Nov 27, 2024	BenchmarkingEarth Observation	CodeCode Available	1
All Languages Matter: Evaluating LMMs on Culturally Diverse 100 Languages	Nov 25, 2024	AllLong Question Answer	CodeCode Available	1
VidComposition: Can MLLMs Analyze Compositions in Compiled Videos?	Nov 17, 2024	Multiple-choice	CodeCode Available	1
MEG: Medical Knowledge-Augmented Large Language Models for Question Answering	Nov 6, 2024	Knowledge Graph EmbeddingsMultiple-choice	CodeCode Available	1
MILU: A Multi-task Indic Language Understanding Benchmark	Nov 4, 2024	Multiple-choiceQuestion Answering	CodeCode Available	1
Delving into the Reversal Curse: How Far Can Large Language Models Generalize?	Oct 24, 2024	Multiple-choice	CodeCode Available	1
TimeSeriesExam: A time series understanding exam	Oct 18, 2024	Anomaly DetectionMultiple-choice	CodeCode Available	1
WorldMedQA-V: a multilingual, multimodal medical examination dataset for multimodal language models evaluation	Oct 16, 2024	BenchmarkingFairness	CodeCode Available	1
MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models	Oct 14, 2024	Multiple-choice	CodeCode Available	1
Taming Overconfidence in LLMs: Reward Calibration in RLHF	Oct 13, 2024	Multiple-choice	CodeCode Available	1
SPORTU: A Comprehensive Sports Understanding Benchmark for Multimodal Large Language Models	Oct 11, 2024	Few-Shot LearningMultiple-choice	CodeCode Available	1
MedQA-CS: Benchmarking Large Language Models Clinical Skills Using an AI-SCE Framework	Oct 2, 2024	BenchmarkingInstruction Following	CodeCode Available	1
A Hitchhikers Guide to Fine-Grained Face Forgery Detection Using Common Sense Reasoning	Oct 1, 2024	Common Sense ReasoningDeepFake Detection	CodeCode Available	1
Boosting Healthcare LLMs Through Retrieved Context	Sep 23, 2024	BenchmarkingMultiple-choice	CodeCode Available	1
Annealed Winner-Takes-All for Motion Forecasting	Sep 17, 2024	AllAutonomous Driving	CodeCode Available	1
Training on the Benchmark Is Not All You Need	Sep 3, 2024	AllMultiple-choice	CodeCode Available	1
TourSynbio: A Multi-Modal Large Model and Agent Framework to Bridge Text and Protein Sequences for Protein Engineering	Aug 27, 2024	Multiple-choiceProtein Folding	CodeCode Available	1
Enhancing Knowledge Tracing with Concept Map and Response Disentanglement	Aug 23, 2024	DisentanglementKnowledge Tracing	CodeCode Available	1
LLMs Are Biased Towards Output Formats! Systematically Evaluating and Mitigating Output Format Bias of LLMs	Aug 16, 2024	Instruction FollowingMultiple-choice	CodeCode Available	1
Annealed Multiple Choice Learning: Overcoming limitations of Winner-takes-all with annealing	Jul 22, 2024	AllDiversity	CodeCode Available	1
Evaluating language models as risk scores	Jul 19, 2024	Multiple-choiceQuestion Answering	CodeCode Available	1
TurkishMMLU: Measuring Massive Multitask Language Understanding in Turkish	Jul 17, 2024	MathMultiple-choice	CodeCode Available	1
Fine-tuning Multimodal Large Language Models for Product Bundling	Jul 16, 2024	In-Context LearningMultiple-choice	CodeCode Available	1
Uncertainty is Fragile: Manipulating Uncertainty in Large Language Models	Jul 15, 2024	Backdoor AttackMultiple-choice	CodeCode Available	1
ORAN-Bench-13K: An Open Source Benchmark for Assessing LLMs in Open Radio Access Networks	Jul 8, 2024	Anomaly DetectionCode Generation	CodeCode Available	1
LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts	Jul 6, 2024	Logical ReasoningMathematical Reasoning	CodeCode Available	1
MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation	Jun 29, 2024	Multiple-choice	CodeCode Available	1
InfiniBench: A Comprehensive Benchmark for Large Multimodal Models in Very Long Video Understanding	Jun 28, 2024	Multiple-choiceVideo Understanding	CodeCode Available	1
HCQA @ Ego4D EgoSchema Challenge 2024	Jun 22, 2024	Caption Generation	CodeCode Available	1
African or European Swallow? Benchmarking Large Vision-Language Models for Fine-Grained Object Classification	Jun 20, 2024	BenchmarkingClassification	CodeCode Available	1
FoodieQA: A Multimodal Dataset for Fine-Grained Understanding of Chinese Food Culture	Jun 16, 2024	DiversityMultiple-choice	CodeCode Available	1
CoLoR-Filter: Conditional Loss Reduction Filtering for Targeted Language Model Pre-training	Jun 15, 2024	Domain AdaptationLanguage Modeling	CodeCode Available	1
IntentionQA: A Benchmark for Evaluating Purchase Intention Comprehension Abilities of Language Models in E-commerce	Jun 14, 2024	Multiple-choiceQuestion Answering	CodeCode Available	1
BLEnD: A Benchmark for LLMs on Everyday Knowledge in Diverse Cultures and Languages	Jun 14, 2024	Multiple-choice	CodeCode Available	1
INS-MMBench: A Comprehensive Benchmark for Evaluating LVLMs' Performance in Insurance	Jun 13, 2024	Multiple-choiceVisual Reasoning	CodeCode Available	1
MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding	Jun 13, 2024	Multiple-choiceScene Understanding	CodeCode Available	1
A Fine-tuning Dataset and Benchmark for Large Language Models for Protein Understanding	Jun 8, 2024	DescriptiveLanguage Modelling	CodeCode Available	1
TopViewRS: Vision-Language Models as Top-View Spatial Reasoners	Jun 4, 2024	Multiple-choiceSpatial Reasoning	CodeCode Available	1
Embedding Trajectory for Out-of-Distribution Detection in Mathematical Reasoning	May 22, 2024	Mathematical ReasoningMultiple-choice	CodeCode Available	1
Multiple-Choice Questions are Efficient and Robust LLM Evaluators	May 20, 2024	GSM8KHumanEval	CodeCode Available	1
SciFIBench: Benchmarking Large Multimodal Models for Scientific Figure Interpretation	May 14, 2024	BenchmarkingMultiple-choice	CodeCode Available	1
THRONE: An Object-based Hallucination Benchmark for the Free-form Generations of Large Vision-Language Models	May 8, 2024	AttributeData Augmentation	CodeCode Available	1
Do Large Language Models Understand Conversational Implicature -- A case study with a chinese sitcom	Apr 30, 2024	ImplicaturesMultiple-choice	CodeCode Available	1
Latxa: An Open Language Model and Evaluation Suite for Basque	Mar 29, 2024	Language ModelingLanguage Modelling	CodeCode Available	1
Non-Linear Inference Time Intervention: Improving LLM Truthfulness	Mar 27, 2024	Large Language ModelMultiple-choice	CodeCode Available	1
IllusionVQA: A Challenging Optical Illusion Dataset for Vision Language Models	Mar 23, 2024	Common Sense ReasoningIn-Context Learning	CodeCode Available	1
Complex Reasoning over Logical Queries on Commonsense Knowledge Graphs	Mar 12, 2024	Knowledge GraphsMultiple-choice	CodeCode Available	1
Unfamiliar Finetuning Examples Control How Language Models Hallucinate	Mar 8, 2024	MMLUMultiple-choice	CodeCode Available	1
To Generate or to Retrieve? On the Effectiveness of Artificial Contexts for Medical Open-Domain Question Answering	Mar 4, 2024	MedQAMMLU	CodeCode Available	1

Show:10 25 50

← PrevPage 3 of 23Next →

No leaderboard results yet.