Multiple-choice

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 301–350 of 1107 papers

Title	Date	Tasks	Status	Hype
Enhancing LLM Evaluations: The Garbling Trick	Nov 3, 2024	Multiple-choice	—Unverified	0
Benchmarking Bias in Large Language Models during Role-Playing	Nov 1, 2024	BenchmarkingFairness	—Unverified	0
R-LLaVA: Improving Med-VQA Understanding through Visual Region of Interest	Oct 27, 2024	Medical Visual Question AnsweringMultiple-choice	—Unverified	0
Improving Model Evaluation using SMART Filtering of Benchmark Datasets	Oct 26, 2024	ChatbotDiversity	CodeCode Available	3
GPT-4o System Card	Oct 25, 2024	Multiple-choiceSpatial Reasoning	—Unverified	0
Delving into the Reversal Curse: How Far Can Large Language Models Generalize?	Oct 24, 2024	Multiple-choice	CodeCode Available	1
Beyond Multiple-Choice Accuracy: Real-World Challenges of Implementing Large Language Models in Healthcare	Oct 24, 2024	Multiple-choice	—Unverified	0
Large Language Models Still Exhibit Bias in Long Text	Oct 23, 2024	FairnessMultiple-choice	—Unverified	0
GeoCode-GPT: A Large Language Model for Geospatial Code Generation Tasks	Oct 22, 2024	Code GenerationCode Summarization	—Unverified	0
How Can We Diagnose and Treat Bias in Large Language Models for Clinical Decision-Making?	Oct 21, 2024	counterfactualDecision Making	CodeCode Available	0
Susu Box or Piggy Bank: Assessing Cultural Commonsense Knowledge between Ghana and the U.S	Oct 21, 2024	Multiple-choice	—Unverified	0
TimeSeriesExam: A time series understanding exam	Oct 18, 2024	Anomaly DetectionMultiple-choice	CodeCode Available	1
Addressing Blind Guessing: Calibration of Selection Bias in Multiple-Choice Question Answering by Video Language Models	Oct 18, 2024	FairnessMultiple-choice	—Unverified	0
LabSafety Bench: Benchmarking LLMs on Safety Issues in Scientific Labs	Oct 18, 2024	BenchmarkingFairness	—Unverified	0
LAR-ECHR: A New Legal Argument Reasoning Task and Dataset for Cases of the European Court of Human Rights	Oct 17, 2024	Legal ReasoningMultiple-choice	—Unverified	0
MCQG-SRefine: Multiple Choice Question Generation and Evaluation with Iterative Self-Critique, Correction, and Comparison Feedback	Oct 17, 2024	Fact VerificationHallucination	CodeCode Available	0
CBT-Bench: Evaluating Large Language Models on Assisting Cognitive Behavior Therapy	Oct 17, 2024	Multiple-choiceResponse Generation	—Unverified	0
WorldMedQA-V: a multilingual, multimodal medical examination dataset for multimodal language models evaluation	Oct 16, 2024	BenchmarkingFairness	CodeCode Available	1
Evaluating the Instruction-following Abilities of Language Models using Knowledge Tasks	Oct 16, 2024	Instruction FollowingMultiple-choice	CodeCode Available	0
Leaving the barn door open for Clever Hans: Simple features predict LLM benchmark answers	Oct 15, 2024	Multiple-choice	CodeCode Available	0
Difficult Task Yes but Simple Task No: Unveiling the Laziness in Multimodal LLMs	Oct 15, 2024	Image DescriptionMultiple-choice	CodeCode Available	0
Not All Options Are Created Equal: Textual Option Weighting for Token-Efficient LLM-Based Knowledge Tracing	Oct 14, 2024	AllBinary Classification	—Unverified	0
Personalised Feedback Framework for Online Education Programmes Using Generative AI	Oct 14, 2024	BenchmarkingManagement	—Unverified	0
MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models	Oct 14, 2024	Multiple-choice	CodeCode Available	1
LOKI: A Comprehensive Synthetic Data Detection Benchmark using Large Multimodal Models	Oct 13, 2024	Multiple-choice	—Unverified	0
LongHalQA: Long-Context Hallucination Evaluation for MultiModal Large Language Models	Oct 13, 2024	HallucinationHallucination Evaluation	CodeCode Available	0
Taming Overconfidence in LLMs: Reward Calibration in RLHF	Oct 13, 2024	Multiple-choice	CodeCode Available	1
The Future of Learning in the Age of Generative AI: Automated Question Generation and Assessment with Large Language Models	Oct 12, 2024	MisconceptionsMultiple-choice	—Unverified	0
NoVo: Norm Voting off Hallucinations with Attention Heads in Large Language Models	Oct 11, 2024	Multiple-choiceTruthfulQA	CodeCode Available	0
SPORTU: A Comprehensive Sports Understanding Benchmark for Multimodal Large Language Models	Oct 11, 2024	Few-Shot LearningMultiple-choice	CodeCode Available	1
Sample then Identify: A General Framework for Risk Control and Assessment in Multimodal Large Language Models	Oct 10, 2024	Conformal PredictionLanguage Modeling	—Unverified	0
MRAG-Bench: Vision-Centric Evaluation for Retrieval-Augmented Multimodal Models	Oct 10, 2024	Multiple-choiceQuestion Answering	—Unverified	0
TVBench: Redesigning Video-Language Evaluation	Oct 10, 2024	Multiple-choiceOpen-Ended Question Answering	—Unverified	0
Answering Questions in Stages: Prompt Chaining for Contract QA	Oct 9, 2024	Multiple-choice	—Unverified	0
Utilize the Flow before Stepping into the Same River Twice: Certainty Represented Knowledge Flow for Refusal-Aware Instruction Tuning	Oct 9, 2024	HallucinationMultiple-choice	CodeCode Available	0
ACPBench: Reasoning about Action, Change, and Planning	Oct 8, 2024	Multiple-choice	—Unverified	0
ActionAtlas: A VideoQA Benchmark for Domain-specialized Action Recognition	Oct 8, 2024	Action RecognitionMultiple-choice	—Unverified	0
Plausibly Problematic Questions in Multiple-Choice Benchmarks for Commonsense Reasoning	Oct 6, 2024	Multiple-choice	CodeCode Available	0
Listening to the Wise Few: Select-and-Copy Attention Heads for Multiple-Choice QA	Oct 3, 2024	Multiple-choiceQuestion Answering	—Unverified	0
Video Instruction Tuning With Synthetic Data	Oct 3, 2024	3D Question Answering (3D-QA)	—Unverified	0
Introducing Flexible Monotone Multiple Choice Item Response Theory Models and Bit Scales	Oct 2, 2024	Multiple-choice	CodeCode Available	0
MedQA-CS: Benchmarking Large Language Models Clinical Skills Using an AI-SCE Framework	Oct 2, 2024	BenchmarkingInstruction Following	CodeCode Available	1
DLP-LoRA: Efficient Task-Specific LoRA Fusion with a Dynamic, Lightweight Plugin for Large Language Models	Oct 2, 2024	Multiple-choiceparameter-efficient fine-tuning	CodeCode Available	0
Language Enhanced Model for Eye (LEME): An Open-Source Ophthalmology-Specific Large Language Model	Oct 1, 2024	AllLanguage Modeling	—Unverified	0
A Hitchhikers Guide to Fine-Grained Face Forgery Detection Using Common Sense Reasoning	Oct 1, 2024	Common Sense ReasoningDeepFake Detection	CodeCode Available	1
Task-Adaptive Pretrained Language Models via Clustered-Importance Sampling	Sep 30, 2024	Language ModelingLanguage Modelling	—Unverified	0
Q-Bench-Video: Benchmarking the Video Quality Understanding of LMMs	Sep 30, 2024	BenchmarkingMultiple-choice	—Unverified	0
Mitigating Selection Bias with Node Pruning and Auxiliary Options	Sep 27, 2024	Multiple-choiceSelection bias	—Unverified	0
DisGeM: Distractor Generation for Multiple Choice Questions with Span Masking	Sep 26, 2024	Distractor GenerationMultiple-choice	CodeCode Available	0
DARE: Diverse Visual Question Answering with Robustness Evaluation	Sep 26, 2024	image-classificationImage Classification	—Unverified	0

Show:10 25 50

← PrevPage 7 of 23Next →

No leaderboard results yet.