| HashEvict: A Pre-Attention KV Cache Eviction Strategy using Locality-Sensitive Hashing | Dec 13, 2024 | GPUMultiple-choice | —Unverified | 0 |
| A multimodal dataset for understanding the impact of mobile phones on remote online virtual education | Dec 13, 2024 | EEGHead Pose Estimation | CodeCode Available | 0 |
| LLM Distillation for Efficient Few-Shot Multiple Choice Question Answering | Dec 13, 2024 | Few-Shot LearningKnowledge Distillation | —Unverified | 0 |
| Does Multiple Choice Have a Future in the Age of Generative AI? A Posttest-only RCT | Dec 13, 2024 | Multiple-choice | CodeCode Available | 0 |
| Filter-then-Generate: Large Language Models with Structure-Text Adapter for Knowledge Graph Completion | Dec 12, 2024 | HallucinationKnowledge Graph Completion | CodeCode Available | 1 |
| Neptune: The Long Orbit to Benchmarking Long Video Understanding | Dec 12, 2024 | BenchmarkingMultimodal Reasoning | CodeCode Available | 2 |
| MM-PoE: Multiple Choice Reasoning via. Process of Elimination using Multi-Modal Models | Dec 10, 2024 | Multiple-choiceQuestion Answering | CodeCode Available | 0 |
| Evaluating and Mitigating Social Bias for Large Language Models in Open-ended Settings | Dec 9, 2024 | Multiple-choice | CodeCode Available | 0 |
| ACQ: A Unified Framework for Automated Programmatic Creativity in Online Advertising | Dec 9, 2024 | Multiple-choiceMulti-Task Learning | —Unverified | 0 |
| Learning to Correction: Explainable Feedback Generation for Visual Commonsense Reasoning Distractor | Dec 8, 2024 | MisconceptionsMultiple-choice | CodeCode Available | 0 |
| MANTA: A Large-Scale Multi-View and Visual-Text Anomaly Detection Dataset for Tiny Objects | Dec 6, 2024 | 2kAnomaly Detection | —Unverified | 0 |
| GRAF: Graph Retrieval Augmented by Facts for Romanian Legal Multi-Choice Question Answering | Dec 5, 2024 | Information RetrievalMultiple-choice | —Unverified | 0 |
| Establishing Task Scaling Laws via Compute-Efficient Model Ladders | Dec 5, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information? | Dec 3, 2024 | Multiple-choice | CodeCode Available | 1 |
| SailCompass: Towards Reproducible and Robust Evaluation for Southeast Asian Languages | Dec 2, 2024 | Multiple-choice | CodeCode Available | 1 |
| The use of large language models to enhance cancer clinical trial educational materials | Dec 2, 2024 | MisinformationMultiple-choice | —Unverified | 0 |
| Unlocking Video-LLM via Agent-of-Thoughts Distillation | Dec 2, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Noise Injection Reveals Hidden Capabilities of Sandbagging Language Models | Dec 2, 2024 | MMLUMultiple-choice | CodeCode Available | 0 |
| Uhura: A Benchmark for Evaluating Scientific Question Answering and Truthfulness in Low-Resource African Languages | Dec 1, 2024 | ARCMultiple-choice | —Unverified | 0 |
| KnowledgePrompts: Exploring the Abilities of Large Language Models to Solve Proportional Analogies via Knowledge-Enhanced Prompting | Dec 1, 2024 | Multiple-choiceMultiple Choice Question Answering (MCQA) | CodeCode Available | 0 |
| VisOnlyQA: Large Vision Language Models Still Struggle with Visual Perception of Geometric Information | Dec 1, 2024 | Multiple-choice | CodeCode Available | 1 |
| Cognitive Biases in Large Language Models: A Survey and Mitigation Experiments | Nov 30, 2024 | Multiple-choice | —Unverified | 0 |
| Perception Test 2024: Challenge Summary and a Novel Hour-Long VideoQA Benchmark | Nov 29, 2024 | BenchmarkingGrounded Video Question Answering | —Unverified | 0 |
| Applying IRT to Distinguish Between Human and Generative AI Responses to Multiple-Choice Assessments | Nov 28, 2024 | Multiple-choice | —Unverified | 0 |
| Sparse Attention Vectors: Generative Multimodal Model Features Are Discriminative Vision-Language Classifiers | Nov 28, 2024 | Image Captioningimage-classification | —Unverified | 0 |
| Multiple Choice Learning for Efficient Speech Separation with Many Speakers | Nov 27, 2024 | Multiple-choiceSpeech Separation | —Unverified | 0 |
| CHOICE: Benchmarking the Remote Sensing Capabilities of Large Vision-Language Models | Nov 27, 2024 | BenchmarkingEarth Observation | CodeCode Available | 1 |
| NEMO: Can Multimodal LLMs Identify Attribute-Modified Objects? | Nov 26, 2024 | AttributeMultiple-choice | —Unverified | 0 |
| GEMeX: A Large-Scale, Groundable, and Explainable Medical VQA Benchmark for Chest X-ray Diagnosis | Nov 25, 2024 | Medical Visual Question AnsweringMultiple-choice | —Unverified | 0 |
| All Languages Matter: Evaluating LMMs on Culturally Diverse 100 Languages | Nov 25, 2024 | AllLong Question Answer | CodeCode Available | 1 |
| SAGEval: The frontiers of Satisfactory Agent based NLG Evaluation for reference-free open-ended text | Nov 25, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| AfriMed-QA: A Pan-African, Multi-Specialty, Medical Question-Answering Benchmark Dataset | Nov 23, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation | Nov 20, 2024 | ChatbotMultiple-choice | —Unverified | 0 |
| Testing Uncertainty of Large Language Models for Physics Knowledge and Reasoning | Nov 18, 2024 | Logical ReasoningMultiple-choice | —Unverified | 0 |
| VidComposition: Can MLLMs Analyze Compositions in Compiled Videos? | Nov 17, 2024 | Multiple-choice | CodeCode Available | 1 |
| A Benchmark for Long-Form Medical Question Answering | Nov 14, 2024 | Answer GenerationForm | CodeCode Available | 0 |
| DAHL: Domain-specific Automated Hallucination Evaluation of Long-Form Text through a Benchmark Dataset in Biomedicine | Nov 14, 2024 | FormHallucination | CodeCode Available | 0 |
| TRACE: Transformer-based Risk Assessment for Clinical Evaluation | Nov 13, 2024 | Decision MakingMissing Values | CodeCode Available | 0 |
| IdentifyMe: A Challenging Long-Context Mention Resolution Benchmark for LLMs | Nov 12, 2024 | coreference-resolutionCoreference Resolution | CodeCode Available | 0 |
| SHARP: Unlocking Interactive Hallucination via Stance Transfer in Role-Playing Agents | Nov 12, 2024 | General KnowledgeHallucination | —Unverified | 0 |
| StoryTeller: Improving Long Video Description through Global Audio-Visual Character Identification | Nov 11, 2024 | Large Language ModelMultimodal Large Language Model | CodeCode Available | 2 |
| Probabilistic Consensus through Ensemble Validation: A Framework for LLM Reliability | Nov 10, 2024 | Multiple-choiceText Generation | —Unverified | 0 |
| Humans and Large Language Models in Clinical Decision Support: A Study with Medical Calculators | Nov 8, 2024 | Decision MakingMultiple-choice | —Unverified | 0 |
| Quantitative Assessment of Intersectional Empathetic Bias and Understanding | Nov 8, 2024 | Multiple-choice | CodeCode Available | 0 |
| ProverbEval: Exploring LLM Evaluation Challenges for Low-resource Language Understanding | Nov 7, 2024 | BenchmarkingMultiple-choice | —Unverified | 0 |
| HourVideo: 1-Hour Video-Language Understanding | Nov 7, 2024 | Benchmarkingcounterfactual | CodeCode Available | 2 |
| MEG: Medical Knowledge-Augmented Large Language Models for Question Answering | Nov 6, 2024 | Knowledge Graph EmbeddingsMultiple-choice | CodeCode Available | 1 |
| MILU: A Multi-task Indic Language Understanding Benchmark | Nov 4, 2024 | Multiple-choiceQuestion Answering | CodeCode Available | 1 |
| FactTest: Factuality Testing in Large Language Models with Finite-Sample and Distribution-Free Guarantees | Nov 4, 2024 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance | Nov 4, 2024 | Caption GenerationMultiple-choice | CodeCode Available | 2 |