| Enhancing LLM Evaluations: The Garbling Trick | Nov 3, 2024 | Multiple-choice | —Unverified | 0 |
| Benchmarking Bias in Large Language Models during Role-Playing | Nov 1, 2024 | BenchmarkingFairness | —Unverified | 0 |
| R-LLaVA: Improving Med-VQA Understanding through Visual Region of Interest | Oct 27, 2024 | Medical Visual Question AnsweringMultiple-choice | —Unverified | 0 |
| Improving Model Evaluation using SMART Filtering of Benchmark Datasets | Oct 26, 2024 | ChatbotDiversity | CodeCode Available | 3 |
| GPT-4o System Card | Oct 25, 2024 | Multiple-choiceSpatial Reasoning | —Unverified | 0 |
| Delving into the Reversal Curse: How Far Can Large Language Models Generalize? | Oct 24, 2024 | Multiple-choice | CodeCode Available | 1 |
| Beyond Multiple-Choice Accuracy: Real-World Challenges of Implementing Large Language Models in Healthcare | Oct 24, 2024 | Multiple-choice | —Unverified | 0 |
| Large Language Models Still Exhibit Bias in Long Text | Oct 23, 2024 | FairnessMultiple-choice | —Unverified | 0 |
| GeoCode-GPT: A Large Language Model for Geospatial Code Generation Tasks | Oct 22, 2024 | Code GenerationCode Summarization | —Unverified | 0 |
| How Can We Diagnose and Treat Bias in Large Language Models for Clinical Decision-Making? | Oct 21, 2024 | counterfactualDecision Making | CodeCode Available | 0 |
| Susu Box or Piggy Bank: Assessing Cultural Commonsense Knowledge between Ghana and the U.S | Oct 21, 2024 | Multiple-choice | —Unverified | 0 |
| TimeSeriesExam: A time series understanding exam | Oct 18, 2024 | Anomaly DetectionMultiple-choice | CodeCode Available | 1 |
| Addressing Blind Guessing: Calibration of Selection Bias in Multiple-Choice Question Answering by Video Language Models | Oct 18, 2024 | FairnessMultiple-choice | —Unverified | 0 |
| LabSafety Bench: Benchmarking LLMs on Safety Issues in Scientific Labs | Oct 18, 2024 | BenchmarkingFairness | —Unverified | 0 |
| LAR-ECHR: A New Legal Argument Reasoning Task and Dataset for Cases of the European Court of Human Rights | Oct 17, 2024 | Legal ReasoningMultiple-choice | —Unverified | 0 |
| MCQG-SRefine: Multiple Choice Question Generation and Evaluation with Iterative Self-Critique, Correction, and Comparison Feedback | Oct 17, 2024 | Fact VerificationHallucination | CodeCode Available | 0 |
| CBT-Bench: Evaluating Large Language Models on Assisting Cognitive Behavior Therapy | Oct 17, 2024 | Multiple-choiceResponse Generation | —Unverified | 0 |
| WorldMedQA-V: a multilingual, multimodal medical examination dataset for multimodal language models evaluation | Oct 16, 2024 | BenchmarkingFairness | CodeCode Available | 1 |
| Evaluating the Instruction-following Abilities of Language Models using Knowledge Tasks | Oct 16, 2024 | Instruction FollowingMultiple-choice | CodeCode Available | 0 |
| Leaving the barn door open for Clever Hans: Simple features predict LLM benchmark answers | Oct 15, 2024 | Multiple-choice | CodeCode Available | 0 |
| Difficult Task Yes but Simple Task No: Unveiling the Laziness in Multimodal LLMs | Oct 15, 2024 | Image DescriptionMultiple-choice | CodeCode Available | 0 |
| Not All Options Are Created Equal: Textual Option Weighting for Token-Efficient LLM-Based Knowledge Tracing | Oct 14, 2024 | AllBinary Classification | —Unverified | 0 |
| Personalised Feedback Framework for Online Education Programmes Using Generative AI | Oct 14, 2024 | BenchmarkingManagement | —Unverified | 0 |
| MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models | Oct 14, 2024 | Multiple-choice | CodeCode Available | 1 |
| LOKI: A Comprehensive Synthetic Data Detection Benchmark using Large Multimodal Models | Oct 13, 2024 | Multiple-choice | —Unverified | 0 |
| LongHalQA: Long-Context Hallucination Evaluation for MultiModal Large Language Models | Oct 13, 2024 | HallucinationHallucination Evaluation | CodeCode Available | 0 |
| Taming Overconfidence in LLMs: Reward Calibration in RLHF | Oct 13, 2024 | Multiple-choice | CodeCode Available | 1 |
| The Future of Learning in the Age of Generative AI: Automated Question Generation and Assessment with Large Language Models | Oct 12, 2024 | MisconceptionsMultiple-choice | —Unverified | 0 |
| NoVo: Norm Voting off Hallucinations with Attention Heads in Large Language Models | Oct 11, 2024 | Multiple-choiceTruthfulQA | CodeCode Available | 0 |
| SPORTU: A Comprehensive Sports Understanding Benchmark for Multimodal Large Language Models | Oct 11, 2024 | Few-Shot LearningMultiple-choice | CodeCode Available | 1 |
| Sample then Identify: A General Framework for Risk Control and Assessment in Multimodal Large Language Models | Oct 10, 2024 | Conformal PredictionLanguage Modeling | —Unverified | 0 |
| MRAG-Bench: Vision-Centric Evaluation for Retrieval-Augmented Multimodal Models | Oct 10, 2024 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| TVBench: Redesigning Video-Language Evaluation | Oct 10, 2024 | Multiple-choiceOpen-Ended Question Answering | —Unverified | 0 |
| Answering Questions in Stages: Prompt Chaining for Contract QA | Oct 9, 2024 | Multiple-choice | —Unverified | 0 |
| Utilize the Flow before Stepping into the Same River Twice: Certainty Represented Knowledge Flow for Refusal-Aware Instruction Tuning | Oct 9, 2024 | HallucinationMultiple-choice | CodeCode Available | 0 |
| ACPBench: Reasoning about Action, Change, and Planning | Oct 8, 2024 | Multiple-choice | —Unverified | 0 |
| ActionAtlas: A VideoQA Benchmark for Domain-specialized Action Recognition | Oct 8, 2024 | Action RecognitionMultiple-choice | —Unverified | 0 |
| Plausibly Problematic Questions in Multiple-Choice Benchmarks for Commonsense Reasoning | Oct 6, 2024 | Multiple-choice | CodeCode Available | 0 |
| Listening to the Wise Few: Select-and-Copy Attention Heads for Multiple-Choice QA | Oct 3, 2024 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| Video Instruction Tuning With Synthetic Data | Oct 3, 2024 | 3D Question Answering (3D-QA) | —Unverified | 0 |
| Introducing Flexible Monotone Multiple Choice Item Response Theory Models and Bit Scales | Oct 2, 2024 | Multiple-choice | CodeCode Available | 0 |
| MedQA-CS: Benchmarking Large Language Models Clinical Skills Using an AI-SCE Framework | Oct 2, 2024 | BenchmarkingInstruction Following | CodeCode Available | 1 |
| DLP-LoRA: Efficient Task-Specific LoRA Fusion with a Dynamic, Lightweight Plugin for Large Language Models | Oct 2, 2024 | Multiple-choiceparameter-efficient fine-tuning | CodeCode Available | 0 |
| Language Enhanced Model for Eye (LEME): An Open-Source Ophthalmology-Specific Large Language Model | Oct 1, 2024 | AllLanguage Modeling | —Unverified | 0 |
| A Hitchhikers Guide to Fine-Grained Face Forgery Detection Using Common Sense Reasoning | Oct 1, 2024 | Common Sense ReasoningDeepFake Detection | CodeCode Available | 1 |
| Task-Adaptive Pretrained Language Models via Clustered-Importance Sampling | Sep 30, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Q-Bench-Video: Benchmarking the Video Quality Understanding of LMMs | Sep 30, 2024 | BenchmarkingMultiple-choice | —Unverified | 0 |
| Mitigating Selection Bias with Node Pruning and Auxiliary Options | Sep 27, 2024 | Multiple-choiceSelection bias | —Unverified | 0 |
| DisGeM: Distractor Generation for Multiple Choice Questions with Span Masking | Sep 26, 2024 | Distractor GenerationMultiple-choice | CodeCode Available | 0 |
| DARE: Diverse Visual Question Answering with Robustness Evaluation | Sep 26, 2024 | image-classificationImage Classification | —Unverified | 0 |