Multiple-choice

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 51–100 of 1107 papers

Title	Date	Tasks	Status	Hype
Understanding Long Videos with Multimodal Language Models	Mar 25, 2024	Action RecognitionFine-grained Action Recognition	CodeCode Available	2
ToMBench: Benchmarking Theory of Mind in Large Language Models	Feb 23, 2024	BenchmarkingMultiple-choice	CodeCode Available	2
tinyBenchmarks: evaluating LLMs with fewer examples	Feb 22, 2024	MMLUMultiple-choice	CodeCode Available	2
CyberMetric: A Benchmark Dataset based on Retrieval-Augmented Generation for Evaluating LLMs in Cybersecurity Knowledge	Feb 12, 2024	General KnowledgeMultiple-choice	CodeCode Available	2
SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models	Feb 7, 2024	DiversityMultiple-choice	CodeCode Available	2
Improving Medical Reasoning through Retrieval and Self-Reflection with Retrieval-Augmented Large Language Models	Jan 27, 2024	Medical Question AnsweringMultiple-choice	CodeCode Available	2
Steering Llama 2 via Contrastive Activation Addition	Dec 9, 2023	Multiple-choice	CodeCode Available	2
Biomedical knowledge graph-optimized prompt generation for large language models	Nov 29, 2023	BenchmarkingKnowledge Graphs	CodeCode Available	2
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark	Nov 28, 2023	3D Question Answering (3D-QA)Diagnostic	CodeCode Available	2
SEED-Bench-2: Benchmarking Multimodal Large Language Models	Nov 28, 2023	BenchmarkingImage Generation	CodeCode Available	2
GPQA: A Graduate-Level Google-Proof Q&A Benchmark	Nov 20, 2023	Multiple-choice	CodeCode Available	2
SafetyBench: Evaluating the Safety of Large Language Models	Sep 13, 2023	Multiple-choice	CodeCode Available	2
The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants	Aug 31, 2023	BelebeleCross-Lingual Transfer	CodeCode Available	2
FinEval: A Chinese Financial Domain Knowledge Evaluation Benchmark for Large Language Models	Aug 19, 2023	Multiple-choice	CodeCode Available	2
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding	Jul 31, 2023	Multiple-choiceQuestion Answering	CodeCode Available	2
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension	Jul 30, 2023	BenchmarkingMultiple-choice	CodeCode Available	2
MQAG: Multiple-choice Question Answering and Generation for Assessing Information Consistency in Summarization	Jan 28, 2023	HallucinationMultiple-choice	CodeCode Available	2
Perception Test: A Diagnostic Benchmark for Multimodal Models	Oct 19, 2022	DiagnosticMultiple-choice	CodeCode Available	2
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering	Sep 20, 2022	Multimodal Deep LearningMultimodal Reasoning	CodeCode Available	2
MedMCQA : A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering	Mar 27, 2022	DiversityMultiple-choice	CodeCode Available	2
All in One: Exploring Unified Video-Language Pre-training	Mar 14, 2022	AllLanguage Modelling	CodeCode Available	2
What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams	Sep 28, 2020	MedQAMultiple-choice	CodeCode Available	2
STSBench: A Spatio-temporal Scenario Benchmark for Multi-modal Large Language Models in Autonomous Driving	Jun 6, 2025	Autonomous DrivingAutonomous Vehicles	CodeCode Available	1
LLMEval-Med: A Real-world Clinical Benchmark for Medical LLMs with Physician Validation	Jun 4, 2025	Multiple-choice	CodeCode Available	1
Polishing Every Facet of the GEM: Testing Linguistic Competence of LLMs and Humans in Korean	Jun 2, 2025	Multiple-choice	CodeCode Available	1
Daily-Omni: Towards Audio-Visual Reasoning with Temporal Alignment across Modalities	May 23, 2025	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	CodeCode Available	1
Let Androids Dream of Electric Sheep: A Human-like Image Implication Understanding and Reasoning Framework	May 22, 2025	Multiple-choiceVisual Question Answering (VQA)	CodeCode Available	1
LogicOCR: Do Your Large Multimodal Models Excel at Logical Reasoning on Text-Rich Images?	May 18, 2025	Logical ReasoningMultimodal Reasoning	CodeCode Available	1
GIE-Bench: Towards Grounded Evaluation for Text-Guided Image Editing	May 16, 2025	Instruction FollowingMultiple-choice	CodeCode Available	1
Ranked Voting based Self-Consistency of Large Language Models	May 16, 2025	Multiple-choiceOpen-Ended Question Answering	CodeCode Available	1
IRLBench: A Multi-modal, Culturally Grounded, Parallel Irish-English Benchmark for Open-Ended LLM Reasoning Evaluation	May 16, 2025	Multiple-choice	CodeCode Available	1
Benchmarking AI scientists in omics data-driven biological research	May 13, 2025	BenchmarkingMultiple-choice	CodeCode Available	1
Assessing the Chemical Intelligence of Large Language Models	May 12, 2025	Multiple-choice	CodeCode Available	1
ChartQAPro: A More Diverse and Challenging Benchmark for Chart Question Answering	Apr 7, 2025	Chart Question AnsweringChart Understanding	CodeCode Available	1
Mobile-MMLU: A Mobile Intelligence Language Understanding Benchmark	Mar 26, 2025	MMLUMultiple-choice	CodeCode Available	1
Language Model Uncertainty Quantification with Attention Chain	Mar 24, 2025	Computational EfficiencyLanguage Modeling	CodeCode Available	1
Hybrid-Level Instruction Injection for Video Token Compression in Multi-modal Large Language Models	Mar 20, 2025	Multiple-choiceVideo Understanding	CodeCode Available	1
MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific Research	Mar 17, 2025	ArticlesBenchmarking	CodeCode Available	1
CUPCase: Clinically Uncommon Patient Cases and Diagnoses Dataset	Mar 8, 2025	Multiple-choice	CodeCode Available	1
AutoLogi: Automated Generation of Logic Puzzles for Evaluating Reasoning Abilities of Large Language Models	Feb 24, 2025	Logical ReasoningMultiple-choice	CodeCode Available	1
Mind the Confidence Gap: Overconfidence, Calibration, and Distractor Effects in Large Language Models	Feb 16, 2025	Multiple-choice	CodeCode Available	1
TUMTraffic-VideoQA: A Benchmark for Unified Spatio-Temporal Video Understanding in Traffic Scenes	Feb 4, 2025	Autonomous DrivingMultiple-choice	CodeCode Available	1
FaceXBench: Evaluating Multimodal LLMs on Face Understanding	Jan 17, 2025	FairnessMultiple-choice	CodeCode Available	1
ToMATO: Verbalizing the Mental States of Role-Playing LLMs for Benchmarking Theory of Mind	Jan 15, 2025	BenchmarkingMultiple-choice	CodeCode Available	1
ZNO-Eval: Benchmarking reasoning capabilities of large language models in Ukrainian	Jan 12, 2025	BenchmarkingMath	CodeCode Available	1
Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation	Jan 6, 2025	Language Model EvaluationLanguage Modeling	CodeCode Available	1
Unifying Specialized Visual Encoders for Video Language Models	Jan 2, 2025	Multiple-choiceVideo Understanding	CodeCode Available	1
Filter-then-Generate: Large Language Models with Structure-Text Adapter for Knowledge Graph Completion	Dec 12, 2024	HallucinationKnowledge Graph Completion	CodeCode Available	1
AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?	Dec 3, 2024	Multiple-choice	CodeCode Available	1
SailCompass: Towards Reproducible and Robust Evaluation for Southeast Asian Languages	Dec 2, 2024	Multiple-choice	CodeCode Available	1

Show:10 25 50

← PrevPage 2 of 23Next →

No leaderboard results yet.