| LLM-as-a-Judge & Reward Model: What They Can and Cannot Do | Sep 17, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Annealed Winner-Takes-All for Motion Forecasting | Sep 17, 2024 | AllAutonomous Driving | CodeCode Available | 1 |
| Cracking the Code: Multi-domain LLM Evaluation on Real-World Professional Exams in Indonesia | Sep 13, 2024 | MathMultiple-choice | —Unverified | 0 |
| Exploring syntactic information in sentence embeddings through multilingual subject-verb agreement | Sep 10, 2024 | Multiple-choiceSentence | —Unverified | 0 |
| Towards Democratizing Multilingual Large Language Models For Medicine Through A Two-Stage Instruction Fine-tuning Approach | Sep 9, 2024 | Computational EfficiencyContinual Pretraining | CodeCode Available | 0 |
| COLUMBUS: Evaluating COgnitive Lateral Understanding through Multiple-choice reBUSes | Sep 6, 2024 | Multiple-choiceQuestion Answering | CodeCode Available | 0 |
| MaterialBENCH: Evaluating College-Level Materials Science Problem-Solving Abilities of Large Language Models | Sep 5, 2024 | Multiple-choice | —Unverified | 0 |
| CMM-Math: A Chinese Multimodal Math Dataset To Evaluate and Enhance the Mathematics Reasoning of Large Multimodal Models | Sep 4, 2024 | GSM8KMath | CodeCode Available | 2 |
| Training on the Benchmark Is Not All You Need | Sep 3, 2024 | AllMultiple-choice | CodeCode Available | 1 |
| The Role of Large Language Models in Musicology: Are We Ready to Trust the Machines? | Sep 3, 2024 | Multiple-choiceQuestion Generation | —Unverified | 0 |