SOTAVerified

Multiple-choice

Papers

Showing 301325 of 1107 papers

TitleStatusHype
Explore then Determine: A GNN-LLM Synergy Framework for Reasoning over Knowledge Graph0
Exposing the Limits of Video-Text Models through Contrast Sets0
Cleared for Takeoff? Compositional & Conditional Reasoning may be the Achilles Heel to (Flight-Booking) Language Agents0
CinePile: A Long Video Question Answering Dataset and Benchmark0
ARGUS: Hallucination and Omission Evaluation in Video-LLMs0
Chat-TS: Enhancing Multi-Modal Reasoning Over Time-Series and Natural Language Data0
Are You Doubtful? Oh, It Might Be Difficult Then! Exploring the Use of Model Uncertainty for Question Difficulty Estimation0
AmazUtah_NLP at SemEval-2024 Task 9: A MultiChoice Question Answering System for Commonsense Defying Reasoning0
A review of faithfulness metrics for hallucination assessment in Large Language Models0
Adaptive Wizard for Removing Cross-Tier Misconfigurations in Active Directory0
Characterizing Large Language Models as Rationalizers of Knowledge-intensive Tasks0
Changing Answer Order Can Decrease MMLU Accuracy0
Evaluating the Symbol Binding Ability of Large Language Models for Multiple-Choice Questions in Vietnamese General Education0
Evaluating Vision-Language and Large Language Models for Automated Student Assessment in Indonesian Classrooms0
Are LLM-generated plain language summaries truly understandable? A large-scale crowdsourced evaluation0
CG-Bench: Clue-grounded Question Answering Benchmark for Long Video Understanding0
Adaptive Crowdsourcing Algorithms for the Bandit Survey Problem0
CFinBench: A Comprehensive Chinese Financial Benchmark for Large Language Models0
Evaluating the Potential of Leading Large Language Models in Reasoning Biology Questions0
CBT-Bench: Evaluating Large Language Models on Assisting Cognitive Behavior Therapy0
A recent evaluation on the performance of LLMs on radiation oncology physics using questions of randomly shuffled options0
AlignMMBench: Evaluating Chinese Multimodal Alignment in Large Vision-Language Models0
AraTrust: An Evaluation of Trustworthiness for LLMs in Arabic0
Recent Advances in Multi-Choice Machine Reading Comprehension: A Survey on Methods and Datasets0
Evaluating the Performance and Robustness of LLMs in Materials Science Q&A and Property Predictions0
Show:102550
← PrevPage 13 of 45Next →

No leaderboard results yet.