| DSGBench: A Diverse Strategic Game Benchmark for Evaluating LLM-based Agents in Complex Decision-Making Environments | Mar 8, 2025 | Decision MakingLarge Language Model | CodeCode Available | 0 |
| No Free Labels: Limitations of LLM-as-a-Judge Without Human Grounding | Mar 7, 2025 | Large Language Model | —Unverified | 0 |
| Revitalizing Saturated Benchmarks: A Weighted Metric Approach for Differentiating Large Language Model Performance | Mar 7, 2025 | ARCLanguage Modeling | —Unverified | 0 |
| SpecServe: Efficient and SLO-Aware Large Language Model Serving with Adaptive Speculative Decoding | Mar 7, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcement Learning | Mar 7, 2025 | Emotion RecognitionLanguage Modeling | CodeCode Available | 5 |
| A Survey of Large Language Model Empowered Agents for Recommendation and Search: Towards Next-Generation Information Retrieval | Mar 7, 2025 | Information RetrievalLanguage Modeling | CodeCode Available | 2 |
| This Is Your Doge, If It Please You: Exploring Deception and Robustness in Mixture of LLMs | Mar 7, 2025 | Large Language ModelMultiple-choice | CodeCode Available | 0 |
| LLM-based Iterative Approach to Metamodeling in Automotive | Mar 7, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| DETQUS: Decomposition-Enhanced Transformers for QUery-focused Summarization | Mar 7, 2025 | DecoderLanguage Modeling | —Unverified | 0 |
| GEMA-Score: Granular Explainable Multi-Agent Score for Radiology Report Evaluation | Mar 7, 2025 | Large Language ModelMedical Report Generation | CodeCode Available | 0 |