| Mind the Confidence Gap: Overconfidence, Calibration, and Distractor Effects in Large Language Models | Feb 16, 2025 | Multiple-choice | CodeCode Available | 1 |
| LogiDynamics: Unraveling the Dynamics of Logical Inference in Large Language Model Reasoning | Feb 16, 2025 | Analogical questionsIn-Context Learning | —Unverified | 0 |
| VisCon-100K: Leveraging Contextual Web Data for Fine-tuning Vision Language Models | Feb 14, 2025 | Image CaptioningLarge Language Model | —Unverified | 0 |
| Objective quantification of mood states using large language models | Feb 13, 2025 | Multiple-choice | —Unverified | 0 |
| Truth Knows No Language: Evaluating Truthfulness Beyond English | Feb 13, 2025 | InformativenessMachine Translation | CodeCode Available | 0 |
| SB-Bench: Stereotype Bias Benchmark for Large Multimodal Models | Feb 12, 2025 | FairnessMultiple-choice | —Unverified | 0 |
| A Semantic Parsing Algorithm to Solve Linear Ordering Problems | Feb 12, 2025 | Multiple-choiceSemantic Parsing | —Unverified | 0 |
| Break the Checkbox: Challenging Closed-Style Evaluations of Cultural Alignment in LLMs | Feb 12, 2025 | Multiple-choiceSurvey | —Unverified | 0 |
| PerCul: A Story-Driven Cultural Evaluation of LLMs in Persian | Feb 11, 2025 | Multiple-choice | —Unverified | 0 |
| Tokenization Standards for Linguistic Integrity: Turkish as a Benchmark | Feb 10, 2025 | MMLUMorphological Analysis | —Unverified | 0 |