SOTAVerified

Multiple-choice

Papers

Showing 601650 of 1107 papers

TitleStatusHype
Fill-in-the-Blank: A Challenging Video Understanding Evaluation Framework0
Fine-tuning BERT with Focus Words for Explanation Regeneration0
An Automatic Evaluation Framework for Multi-turn Medical Consultations Capabilities of Large Language Models0
An Automated Multiple-Choice Question Generation Using Natural Language Processing Techniques0
First Place Solution to the Multiple-choice Video QA Track of The Second Perception Test Challenge0
First Token Probability Guided RAG for Telecom Question Answering0
An Audio-enriched BERT-based Framework for Spoken Multiple-choice Question Answering0
Which of These Best Describes Multiple Choice Evaluation with LLMs? A) Forced B) Flawed C) Fixable D) All of the Above0
Training Optimus Prime, M.D.: Generating Medical Certification Items by Fine-Tuning OpenAI's gpt2 Transformer Model0
ForecastQA: A Question Answering Challenge for Event Forecasting with Temporal Text Data0
FoundaBench: Evaluating Chinese Fundamental Knowledge Capabilities of Large Language Models0
Framing QA as Building and Ranking Intersentence Answer Justifications0
From ChatGPT to DeepSeek AI: A Comprehensive Analysis of Evolution, Deviation, and Future Implications in AI-Language Models0
From 'F' to 'A' on the N.Y. Regents Science Exams: An Overview of the Aristo Project0
From Generalist to Specialist: Improving Large Language Models for Medical Physics Using ARCoT0
SHARP: Unlocking Interactive Hallucination via Stance Transfer in Role-Playing Agents0
Fundamental Limitations in Defending LLM Finetuning APIs0
FusionMind -- Improving question and answering with external context fusion0
GANDALF: a General Character Name Description Dataset for Long Fiction0
GEMeX: A Large-Scale, Groundable, and Explainable Medical VQA Benchmark for Chest X-ray Diagnosis0
Generalised Winograd Schema and its Contextuality0
Generalization v.s. Memorization: Tracing Language Models' Capabilities Back to Pretraining Data0
Who did What: A Large-Scale Person-Centered Cloze Dataset0
Generating Adequate Distractors for Multiple-Choice Questions0
Generating Correct Answers for Progressive Matrices Intelligence Tests0
Generating Diagnostic Multiple Choice Comprehension Cloze Questions0
Who's the Best Detective? LLMs vs. MLs in Detecting Incoherent Fourth Grade Math Answers0
Generating multiple-choice questions for medical question answering with distractors and cue-masking0
Generating Plausible Distractors for Multiple-Choice Questions via Student Choice Prediction0
Generating Questions and Multiple-Choice Answers using Semantic Analysis of Texts0
GenNet : Reading Comprehension with Multiple Choice Questions using Generation and Selection model0
Genome-Bench: A Scientific Reasoning Benchmark from Real-World Expert Discussions0
GeoCode-GPT: A Large Language Model for Geospatial Code Generation Tasks0
Good, Better, Best: Textual Distractors Generation for Multiple-Choice Visual Question Answering via Reinforcement Learning0
Evaluating Clinical Competencies of Large Language Models with a General Practice Benchmark0
Analyzing Zero-Shot Abilities of Vision-Language Models on Video Understanding Tasks0
GPT-4o System Card0
GPT-4 to GPT-3.5: 'Hold My Scalpel' -- A Look at the Competency of OpenAI's GPT on the Plastic Surgery In-Service Training Exam0
Transliteration: A Simple Technique For Improving Multilingual Language Modeling0
True Detective: A Deep Abductive Reasoning Benchmark Undoable for GPT-3 and Challenging for GPT-40
GRAF: Graph Retrieval Augmented by Facts for Romanian Legal Multi-Choice Question Answering0
GraphITE: Estimating Individual Effects of Graph-structured Treatments0
Graph-Structured Representations for Visual Question Answering0
Is There No Such Thing as a Bad Question? H4R: HalluciBot For Ratiocination, Rewriting, Ranking, and Routing0
Hanfu-Bench: A Multimodal Benchmark on Cross-Temporal Cultural Understanding and Transcreation0
HANS, are you clever? Clever Hans Effect Analysis of Neural Systems0
HardML: A Benchmark For Evaluating Data Science And Machine Learning knowledge and reasoning in AI0
HashEvict: A Pre-Attention KV Cache Eviction Strategy using Locality-Sensitive Hashing0
HATS: Hindi Analogy Test Set for Evaluating Reasoning in Large Language Models0
Have Large Language Models Developed a Personality?: Applicability of Self-Assessment Tests in Measuring Personality in LLMs0
Show:102550
← PrevPage 13 of 23Next →

No leaderboard results yet.