| UrbanVideo-Bench: Benchmarking Vision-Language Models on Embodied Intelligence with Video Data in Urban Spaces | Mar 8, 2025 | Benchmarkingcounterfactual | —Unverified | 0 |
| Towards Conversational AI for Disease Management | Mar 8, 2025 | Clinical KnowledgeDiagnostic | —Unverified | 0 |
| SCoRE: Benchmarking Long-Chain Reasoning in Commonsense Scenarios | Mar 8, 2025 | BenchmarkingDiagnostic | CodeCode Available | 0 |
| CUPCase: Clinically Uncommon Patient Cases and Diagnoses Dataset | Mar 8, 2025 | Multiple-choice | CodeCode Available | 1 |
| Correctness Coverage Evaluation for Medical Multiple-Choice Question Answering Based on the Enhanced Conformal Prediction Framework | Mar 7, 2025 | Conformal PredictionMedical Question Answering | —Unverified | 0 |
| This Is Your Doge, If It Please You: Exploring Deception and Robustness in Mixture of LLMs | Mar 7, 2025 | Large Language ModelMultiple-choice | CodeCode Available | 0 |
| The impact of AI and peer feedback on research writing skills: a study using the CGScholar platform among Kazakhstani scholars | Mar 5, 2025 | Multiple-choiceSurvey | —Unverified | 0 |
| Analogical Reasoning Inside Large Language Models: Concept Vectors and the Limits of Abstraction | Mar 5, 2025 | In-Context LearningMultiple-choice | CodeCode Available | 0 |
| Structured Outputs Enable General-Purpose LLMs to be Medical Experts | Mar 5, 2025 | Clinical KnowledgeMedical Question Answering | —Unverified | 0 |
| None of the Above, Less of the Right: Parallel Patterns between Humans and LLMs on Multi-Choice Questions Answering | Mar 3, 2025 | Business EthicsEthics | —Unverified | 0 |