SOTAVerified

Multiple-choice

Papers

Showing 301350 of 1107 papers

TitleStatusHype
First Token Probability Guided RAG for Telecom Question Answering0
From Generalist to Specialist: Improving Large Language Models for Medical Physics Using ARCoT0
Cleared for Takeoff? Compositional & Conditional Reasoning may be the Achilles Heel to (Flight-Booking) Language Agents0
CinePile: A Long Video Question Answering Dataset and Benchmark0
ARGUS: Hallucination and Omission Evaluation in Video-LLMs0
Chat-TS: Enhancing Multi-Modal Reasoning Over Time-Series and Natural Language Data0
Are You Doubtful? Oh, It Might Be Difficult Then! Exploring the Use of Model Uncertainty for Question Difficulty Estimation0
AmazUtah_NLP at SemEval-2024 Task 9: A MultiChoice Question Answering System for Commonsense Defying Reasoning0
A review of faithfulness metrics for hallucination assessment in Large Language Models0
Adaptive Wizard for Removing Cross-Tier Misconfigurations in Active Directory0
Characterizing Large Language Models as Rationalizers of Knowledge-intensive Tasks0
Changing Answer Order Can Decrease MMLU Accuracy0
Fill-in-the-Blank: A Challenging Video Understanding Evaluation Framework0
Are LLM-generated plain language summaries truly understandable? A large-scale crowdsourced evaluation0
CG-Bench: Clue-grounded Question Answering Benchmark for Long Video Understanding0
Adaptive Crowdsourcing Algorithms for the Bandit Survey Problem0
CFinBench: A Comprehensive Chinese Financial Benchmark for Large Language Models0
Few-Shot Image Classification and Segmentation as Visual Question Answering Using Vision-Language Models0
CBT-Bench: Evaluating Large Language Models on Assisting Cognitive Behavior Therapy0
A recent evaluation on the performance of LLMs on radiation oncology physics using questions of randomly shuffled options0
AlignMMBench: Evaluating Chinese Multimodal Alignment in Large Vision-Language Models0
AraTrust: An Evaluation of Trustworthiness for LLMs in Arabic0
Recent Advances in Multi-Choice Machine Reading Comprehension: A Survey on Methods and Datasets0
Field-testing items using artificial intelligence: Natural language processing with transformers0
Fine-tuning BERT with Focus Words for Explanation Regeneration0
Can We Trust LLMs? Mitigate Overconfidence Bias in LLMs through Knowledge Transfer0
AraSTEM: A Native Arabic Multiple Choice Question Benchmark for Evaluating LLMs Knowledge In STEM Subjects0
FAMULUS: Interactive Annotation and Feedback Generation for Teaching Diagnostic Reasoning0
Can Multimodal LLMs do Visual Temporal Understanding and Reasoning? The answer is No!0
GeoSQA: A Benchmark for Scenario-based Question Answering in the Geography Domain at High School Level0
FarsEval-PKBETS: A new diverse benchmark for evaluating Persian large language models0
Aqulia-Med LLM: Pioneering Full-Process Open-Source Medical Language Models0
FactTest: Factuality Testing in Large Language Models with Finite-Sample and Distribution-Free Guarantees0
Can Generative Pre-trained Transformers (GPT) Pass Assessments in Higher Education Programming Courses?0
Can Crowdsourcing be used for Effective Annotation of Arabic?0
Can ChatGPT pass the Vietnamese National High School Graduation Examination?0
Applying IRT to Distinguish Between Human and Generative AI Responses to Multiple-Choice Assessments0
Enhancing lexical-based approach with external knowledge for Vietnamese multiple-choice machine reading comprehension0
Can AI Master Construction Management (CM)? Benchmarking State-of-the-Art Large Language Models on CM Certification Exams0
A Joint-Reasoning based Disease Q&A System0
How Additional Knowledge can Improve Natural Language Commonsense Question Answering?0
Analysis of the Cambridge Multiple-Choice Questions Reading Dataset with a Focus on Candidate Response Distribution0
Answer Uncertainty and Unanswerability in Multiple-Choice Machine Reading Comprehension0
Answer Uncertainty and Unanswerability in Multiple-Choice Machine Reading Comprehension0
Bridging the Language Gap: Knowledge Injected Multilingual Question Answering0
Bridging Information-Seeking Human Gaze and Machine Reading Comprehension0
Adapting Vision-Language Models for Evaluating World Models0
Exposing the Limits of Video-Text Models through Contrast Sets0
FAVOR-Bench: A Comprehensive Benchmark for Fine-Grained Video Motion Understanding0
SHARP: Unlocking Interactive Hallucination via Stance Transfer in Role-Playing Agents0
Show:102550
← PrevPage 7 of 23Next →

No leaderboard results yet.