| LLaMa-SciQ: An Educational Chatbot for Answering Science MCQ | Sep 25, 2024 | ChatbotGSM8K | —Unverified | 0 |
| RISCORE: Enhancing In-Context Riddle Solving in Language Models through Context-Reconstructed Example Augmentation | Sep 24, 2024 | Multiple-choiceSentence | —Unverified | 0 |
| Boosting Healthcare LLMs Through Retrieved Context | Sep 23, 2024 | BenchmarkingMultiple-choice | CodeCode Available | 1 |
| Detect, Describe, Discriminate: Moving Beyond VQA for MLLM Evaluation | Sep 23, 2024 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| Evaluating the Performance and Robustness of LLMs in Materials Science Q&A and Property Predictions | Sep 22, 2024 | Band GapIn-Context Learning | —Unverified | 0 |
| QMOS: Enhancing LLMs for Telecommunication with Question Masked loss and Option Shuffling | Sep 21, 2024 | Multiple-choicePrompt Engineering | CodeCode Available | 0 |
| First Place Solution to the Multiple-choice Video QA Track of The Second Perception Test Challenge | Sep 20, 2024 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| Bilingual Evaluation of Language Models on General Knowledge in University Entrance Exams with Minimal Contamination | Sep 19, 2024 | General KnowledgeMMLU | —Unverified | 0 |
| Efficient Knowledge Distillation: Empowering Small Language Models with Teacher Model Insights | Sep 19, 2024 | Decision MakingKnowledge Distillation | —Unverified | 0 |
| Edu-Values: Towards Evaluating the Chinese Education Values of Large Language Models | Sep 19, 2024 | EthicsMultiple-choice | CodeCode Available | 0 |
| LLM-as-a-Judge & Reward Model: What They Can and Cannot Do | Sep 17, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Annealed Winner-Takes-All for Motion Forecasting | Sep 17, 2024 | AllAutonomous Driving | CodeCode Available | 1 |
| Cracking the Code: Multi-domain LLM Evaluation on Real-World Professional Exams in Indonesia | Sep 13, 2024 | MathMultiple-choice | —Unverified | 0 |
| Exploring syntactic information in sentence embeddings through multilingual subject-verb agreement | Sep 10, 2024 | Multiple-choiceSentence | —Unverified | 0 |
| Towards Democratizing Multilingual Large Language Models For Medicine Through A Two-Stage Instruction Fine-tuning Approach | Sep 9, 2024 | Computational EfficiencyContinual Pretraining | CodeCode Available | 0 |
| COLUMBUS: Evaluating COgnitive Lateral Understanding through Multiple-choice reBUSes | Sep 6, 2024 | Multiple-choiceQuestion Answering | CodeCode Available | 0 |
| MaterialBENCH: Evaluating College-Level Materials Science Problem-Solving Abilities of Large Language Models | Sep 5, 2024 | Multiple-choice | —Unverified | 0 |
| CMM-Math: A Chinese Multimodal Math Dataset To Evaluate and Enhance the Mathematics Reasoning of Large Multimodal Models | Sep 4, 2024 | GSM8KMath | CodeCode Available | 2 |
| Training on the Benchmark Is Not All You Need | Sep 3, 2024 | AllMultiple-choice | CodeCode Available | 1 |
| The Role of Large Language Models in Musicology: Are We Ready to Trust the Machines? | Sep 3, 2024 | Multiple-choiceQuestion Generation | —Unverified | 0 |
| Novel-WD: Exploring acquisition of Novel World Knowledge in LLMs Using Prefix-Tuning | Aug 30, 2024 | Causal Language ModelingContinual Learning | —Unverified | 0 |
| Wait, that's not an option: LLMs Robustness with Incorrect Multiple-Choice Options | Aug 27, 2024 | Decision MakingMultiple-choice | CodeCode Available | 0 |
| TourSynbio: A Multi-Modal Large Model and Agent Framework to Bridge Text and Protein Sequences for Protein Engineering | Aug 27, 2024 | Multiple-choiceProtein Folding | CodeCode Available | 1 |
| Vision-Language and Large Language Model Performance in Gastroenterology: GPT, Claude, Llama, Phi, Mistral, Gemma, and Quantized Models | Aug 25, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| Enhancing Knowledge Tracing with Concept Map and Response Disentanglement | Aug 23, 2024 | DisentanglementKnowledge Tracing | CodeCode Available | 1 |
| Towards Evaluating and Building Versatile Large Language Models for Medicine | Aug 22, 2024 | Multiple-choicenamed-entity-recognition | CodeCode Available | 2 |
| Large Language Models Are Self-Taught Reasoners: Enhancing LLM Applications via Tailored Problem-Solving Demonstrations | Aug 22, 2024 | Multiple-choice | —Unverified | 0 |
| Differentiating Choices via Commonality for Multiple-Choice Question Answering | Aug 21, 2024 | Multiple-choiceMultiple Choice Question Answering (MCQA) | CodeCode Available | 0 |
| How Susceptible are LLMs to Influence in Prompts? | Aug 17, 2024 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| Measuring Agreeableness Bias in Multimodal Models | Aug 17, 2024 | Decision MakingMultiple-choice | CodeCode Available | 0 |
| Chain-of-Exemplar: Enhancing Distractor Generation for Multimodal Educational Question Generation | Aug 16, 2024 | Distractor GenerationMultiple-choice | CodeCode Available | 0 |
| LLMs Are Biased Towards Output Formats! Systematically Evaluating and Mitigating Output Format Bias of LLMs | Aug 16, 2024 | Instruction FollowingMultiple-choice | CodeCode Available | 1 |
| Examining the Behavior of LLM Architectures Within the Framework of Standardized National Exams in Brazil | Aug 9, 2024 | MathMultiple-choice | —Unverified | 0 |
| LLaVA-OneVision: Easy Visual Task Transfer | Aug 6, 2024 | 3D Question Answering (3D-QA) | CodeCode Available | 0 |
| Winning Amazon KDD Cup'24 | Aug 5, 2024 | Data AugmentationMultiple-choice | —Unverified | 0 |
| XMainframe: A Large Language Model for Mainframe Modernization | Aug 5, 2024 | Code SummarizationLanguage Modeling | CodeCode Available | 2 |
| MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models | Aug 5, 2024 | Image ComprehensionMultiple-choice | CodeCode Available | 2 |
| Recent Advances in Multi-Choice Machine Reading Comprehension: A Survey on Methods and Datasets | Aug 4, 2024 | Few-Shot LearningMachine Reading Comprehension | —Unverified | 0 |
| MiniCPM-V: A GPT-4V Level MLLM on Your Phone | Aug 3, 2024 | HallucinationMultiple-choice | CodeCode Available | 12 |
| MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models | Aug 2, 2024 | Multimodal ReasoningMultiple-choice | CodeCode Available | 3 |
| Improved Few-Shot Image Classification Through Multiple-Choice Questions | Jul 23, 2024 | ArticlesFew-Shot Image Classification | —Unverified | 0 |
| Do LLMs Know When to NOT Answer? Investigating Abstention Abilities of Large Language Models | Jul 23, 2024 | Language ModellingLarge Language Model | —Unverified | 0 |
| MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity | Jul 22, 2024 | DiversityMultiple-choice | CodeCode Available | 2 |
| Annealed Multiple Choice Learning: Overcoming limitations of Winner-takes-all with annealing | Jul 22, 2024 | AllDiversity | CodeCode Available | 1 |
| LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding | Jul 22, 2024 | Multiple-choiceQuestion Answering | CodeCode Available | 2 |
| Answer, Assemble, Ace: Understanding How Transformers Answer Multiple Choice Questions | Jul 21, 2024 | Multiple-choiceMultiple Choice Question Answering (MCQA) | —Unverified | 0 |
| MIBench: Evaluating Multimodal Large Language Models over Multiple Images | Jul 21, 2024 | In-Context LearningMultiple-choice | —Unverified | 0 |
| Modular Sentence Encoders: Separating Language Specialization from Cross-Lingual Alignment | Jul 20, 2024 | Contrastive LearningMultiple-choice | CodeCode Available | 0 |
| Generalization v.s. Memorization: Tracing Language Models' Capabilities Back to Pretraining Data | Jul 20, 2024 | Language ModellingMachine Translation | —Unverified | 0 |
| Evaluating language models as risk scores | Jul 19, 2024 | Multiple-choiceQuestion Answering | CodeCode Available | 1 |