| HintsOfTruth: A Multimodal Checkworthiness Detection Dataset with Real and Synthetic Claims | Feb 17, 2025 | BenchmarkingFact Checking | CodeCode Available | 1 |
| Language Complexity Measurement as a Noisy Zero-Shot Proxy for Evaluating LLM Performance | Feb 17, 2025 | BenchmarkingDependency Parsing | —Unverified | 0 |
| Knowledge-aware contrastive heterogeneous molecular graph learning | Feb 17, 2025 | BenchmarkingContrastive Learning | —Unverified | 0 |
| Ad-hoc Concept Forming in the Game Codenames as a Means for Evaluating Large Language Models | Feb 17, 2025 | Benchmarking | —Unverified | 0 |
| Plant in Cupboard, Orange on Rably, Inat Aphone. Benchmarking Incremental Learning of Situation and Language Model using a Text-Simulated Situated Environment | Feb 17, 2025 | BenchmarkingCommon Sense Reasoning | —Unverified | 0 |
| Energy-Conscious LLM Decoding: Impact of Text Generation Strategies on GPU Energy Consumption | Feb 17, 2025 | BenchmarkingCode Summarization | —Unverified | 0 |
| Ansatz-free Hamiltonian learning with Heisenberg-limited scaling | Feb 17, 2025 | Benchmarking | —Unverified | 0 |
| JExplore: Design Space Exploration Tool for Nvidia Jetson Boards | Feb 16, 2025 | BenchmarkingGPU | CodeCode Available | 0 |
| TituLLMs: A Family of Bangla LLMs with Comprehensive Benchmarking | Feb 16, 2025 | Benchmarking | —Unverified | 0 |
| Can't See the Forest for the Trees: Benchmarking Multimodal Safety Awareness for Multimodal LLMs | Feb 16, 2025 | Benchmarking | —Unverified | 0 |
| Yesil o1 Pro: Evidence-Based AI Model for Health and Benchmarking in Clinical Decision Support | Feb 15, 2025 | BenchmarkingEpidemiology | —Unverified | 0 |
| User Profile with Large Language Models: Construction, Updating, and Benchmarking | Feb 15, 2025 | BenchmarkingProfile Generation | —Unverified | 0 |
| Generalized Attention Flow: Feature Attribution for Transformer Models via Maximum Flow | Feb 14, 2025 | Benchmarking | —Unverified | 0 |
| LaRA: Benchmarking Retrieval-Augmented Generation and Long-Context LLMs - No Silver Bullet for LC or RAG Routing | Feb 14, 2025 | BenchmarkingRAG | CodeCode Available | 0 |
| MIR-Bench: Can Your LLM Recognize Complicated Patterns via Many-Shot In-Context Reasoning? | Feb 14, 2025 | BenchmarkingIn-Context Learning | —Unverified | 0 |
| Benchmarking the rationality of AI decision making using the transitivity axiom | Feb 14, 2025 | BenchmarkingDecision Making | —Unverified | 0 |
| Forecasting time series with constraints | Feb 14, 2025 | Additive modelsBenchmarking | CodeCode Available | 0 |
| A Survey on LLM-based News Recommender Systems | Feb 13, 2025 | BenchmarkingFairness | —Unverified | 0 |
| AT-Drone: Benchmarking Adaptive Teaming in Multi-Drone Pursuit | Feb 13, 2025 | BenchmarkingEdge-computing | —Unverified | 0 |
| MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency | Feb 13, 2025 | BenchmarkingMath | —Unverified | 0 |
| Do LLMs Recognize Your Preferences? Evaluating Personalized Preference Following in LLMs | Feb 13, 2025 | BenchmarkingRetrieval | CodeCode Available | 1 |
| Beyond the Singular: The Essential Role of Multiple Generations in Effective Benchmark Evaluation and Analysis | Feb 13, 2025 | Benchmarking | —Unverified | 0 |
| Standardisation of Convex Ultrasound Data Through Geometric Analysis and Augmentation | Feb 13, 2025 | Benchmarking | —Unverified | 0 |
| EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents | Feb 13, 2025 | Benchmarking | —Unverified | 0 |
| Zero-shot generation of synthetic neurosurgical data with large language models | Feb 13, 2025 | BenchmarkingDe-identification | CodeCode Available | 0 |