SOTAVerified

Multiple-choice

Papers

Showing 10261050 of 1107 papers

TitleStatusHype
Good, Better, Best: Textual Distractors Generation for Multiple-Choice Visual Question Answering via Reinforcement Learning0
Evaluating Clinical Competencies of Large Language Models with a General Practice Benchmark0
Analyzing Zero-Shot Abilities of Vision-Language Models on Video Understanding Tasks0
GPT-4o System Card0
GPT-4 to GPT-3.5: 'Hold My Scalpel' -- A Look at the Competency of OpenAI's GPT on the Plastic Surgery In-Service Training Exam0
Transliteration: A Simple Technique For Improving Multilingual Language Modeling0
True Detective: A Deep Abductive Reasoning Benchmark Undoable for GPT-3 and Challenging for GPT-40
GRAF: Graph Retrieval Augmented by Facts for Romanian Legal Multi-Choice Question Answering0
GraphITE: Estimating Individual Effects of Graph-structured Treatments0
Graph-Structured Representations for Visual Question Answering0
Is There No Such Thing as a Bad Question? H4R: HalluciBot For Ratiocination, Rewriting, Ranking, and Routing0
Hanfu-Bench: A Multimodal Benchmark on Cross-Temporal Cultural Understanding and Transcreation0
HANS, are you clever? Clever Hans Effect Analysis of Neural Systems0
HardML: A Benchmark For Evaluating Data Science And Machine Learning knowledge and reasoning in AI0
HashEvict: A Pre-Attention KV Cache Eviction Strategy using Locality-Sensitive Hashing0
HATS: Hindi Analogy Test Set for Evaluating Reasoning in Large Language Models0
Have Large Language Models Developed a Personality?: Applicability of Self-Assessment Tests in Measuring Personality in LLMs0
Why Has Predicting Downstream Capabilities of Frontier AI Models with Scale Remained Elusive?0
Analyzing the Performance of ChatGPT in Cardiology and Vascular Pathologies0
Healthy LLMs? Benchmarking LLM Knowledge of UK Government Public Health Information0
HFL-RC System at SemEval-2018 Task 11: Hybrid Multi-Aspects Model for Commonsense Reading Comprehension0
Hierarchical Divide-and-Conquer for Fine-Grained Alignment in LLM-Based Medical Evaluation0
HindiLLM: Large Language Model for Hindi0
Analyzing Multiple-Choice Reading and Listening Comprehension Tests0
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites0
Show:102550
← PrevPage 42 of 45Next →

No leaderboard results yet.