| Taming Overconfidence in LLMs: Reward Calibration in RLHF | Oct 13, 2024 | Multiple-choice | CodeCode Available | 1 |
| SPORTU: A Comprehensive Sports Understanding Benchmark for Multimodal Large Language Models | Oct 11, 2024 | Few-Shot LearningMultiple-choice | CodeCode Available | 1 |
| MedQA-CS: Benchmarking Large Language Models Clinical Skills Using an AI-SCE Framework | Oct 2, 2024 | BenchmarkingInstruction Following | CodeCode Available | 1 |
| A Hitchhikers Guide to Fine-Grained Face Forgery Detection Using Common Sense Reasoning | Oct 1, 2024 | Common Sense ReasoningDeepFake Detection | CodeCode Available | 1 |
| Boosting Healthcare LLMs Through Retrieved Context | Sep 23, 2024 | BenchmarkingMultiple-choice | CodeCode Available | 1 |
| Annealed Winner-Takes-All for Motion Forecasting | Sep 17, 2024 | AllAutonomous Driving | CodeCode Available | 1 |
| Training on the Benchmark Is Not All You Need | Sep 3, 2024 | AllMultiple-choice | CodeCode Available | 1 |
| TourSynbio: A Multi-Modal Large Model and Agent Framework to Bridge Text and Protein Sequences for Protein Engineering | Aug 27, 2024 | Multiple-choiceProtein Folding | CodeCode Available | 1 |
| Enhancing Knowledge Tracing with Concept Map and Response Disentanglement | Aug 23, 2024 | DisentanglementKnowledge Tracing | CodeCode Available | 1 |
| LLMs Are Biased Towards Output Formats! Systematically Evaluating and Mitigating Output Format Bias of LLMs | Aug 16, 2024 | Instruction FollowingMultiple-choice | CodeCode Available | 1 |