SOTAVerified

Multiple-choice

Papers

Showing 401450 of 1107 papers

TitleStatusHype
LogiDynamics: Unraveling the Dynamics of Logical Inference in Large Language Model Reasoning0
VisCon-100K: Leveraging Contextual Web Data for Fine-tuning Vision Language Models0
Objective quantification of mood states using large language models0
Truth Knows No Language: Evaluating Truthfulness Beyond EnglishCode0
A Semantic Parsing Algorithm to Solve Linear Ordering Problems0
SB-Bench: Stereotype Bias Benchmark for Large Multimodal Models0
Break the Checkbox: Challenging Closed-Style Evaluations of Cultural Alignment in LLMs0
PerCul: A Story-Driven Cultural Evaluation of LLMs in Persian0
Tokenization Standards for Linguistic Integrity: Turkish as a Benchmark0
HSI: Head-Specific Intervention Can Induce Misaligned AI Coordination in Large Language ModelsCode0
Investigating the Shortcomings of LLMs in Step-by-Step Legal ReasoningCode0
ARR: Question Answering with Large Language Models via Analyzing, Retrieving, and ReasoningCode0
LLMs to Support a Domain Specific Knowledge Assistant0
The Order Effect: Investigating Prompt Sensitivity to Input Order in LLMs0
Evalita-LLM: Benchmarking Large Language Models on Italian0
The Use of Artificial Intelligence Tools in Assessing Content Validity: A Comparative Study with Human Experts0
CoddLLM: Empowering Large Language Models for Data Analytics0
InnerThoughts: Disentangling Representations and Predictions in Large Language Models0
Inferring from Logits: Exploring Best Practices for Decoding-Free Generative Candidate Selection0
Town Hall Debate Prompting: Enhancing Logical Reasoning in LLMs through Multi-Persona Interaction0
Attribution analysis of legal language as used by LLM0
Options-Aware Dense Retrieval for Multiple-Choice query Answering0
HardML: A Benchmark For Evaluating Data Science And Machine Learning knowledge and reasoning in AI0
Option-ID Based Elimination For Multiple Choice QuestionsCode0
LongReason: A Synthetic Long-Context Reasoning Benchmark via Context Expansion0
LLM Evaluation Based on Aerospace Manufacturing Expertise: Automated Generation and Multi-Model Question Answering0
Humanity's Last Exam0
On the Reasoning Capacity of AI Models and How to Quantify It0
Auto-Evaluation: A Critical Measure in Driving Improvements in Quality and Safety of AI-Generated Lesson Resources0
Patent Figure Classification using Large Vision-language ModelsCode0
The AI Penalization Effect: People Reduce Compensation for Workers Who Use AI0
Generating Plausible Distractors for Multiple-Choice Questions via Student Choice Prediction0
Can Multimodal LLMs do Visual Temporal Understanding and Reasoning? The answer is No!0
Empowering Large Language Models in Wireless Communication: A Novel Dataset and Fine-Tuning Framework0
Vision-Language Models Do Not Understand Negation0
Multiple Choice Questions: Reasoning Makes Large Language Models (LLMs) More Self-Confident Even When They Are Wrong0
Towards Multilingual LLM Evaluation for Baltic and Nordic languages: A study on Lithuanian History0
Rethinking AI Cultural Alignment0
Hierarchical Divide-and-Conquer for Fine-Grained Alignment in LLM-Based Medical Evaluation0
First Token Probability Guided RAG for Telecom Question Answering0
Fleurs-SLU: A Massively Multilingual Benchmark for Spoken Language UnderstandingCode0
Affordably Fine-tuned LLMs Provide Better Answers to Course-specific MCQsCode0
Knowledge Retrieval Based on Generative AI0
DRIVINGVQA: Analyzing Visual Chain-of-Thought Reasoning of Vision Language Models in Real-World Scenarios with Driving Theory Tests0
Localizing AI: Evaluating Open-Weight Language Models for Languages of Baltic States0
(WhyPHI) Fine-Tuning PHI-3 for Multiple-Choice Question Answering: Methodology, Results, and ChallengesCode0
CLIP-UP: CLIP-Based Unanswerable Problem Detection for Visual Question Answering0
FSBench: A Figure Skating Benchmark for Advancing Artistic Sports UnderstandingCode0
Enhancing Video-LLM Reasoning via Agent-of-Thoughts Distillation0
Q-Bench-Video: Benchmark the Video Quality Understanding of LMMs0
Show:102550
← PrevPage 9 of 23Next →

No leaderboard results yet.