| FetalCLIP: A Visual-Language Foundation Model for Fetal Ultrasound Image Analysis | Feb 20, 2025 | Age EstimationBenchmarking | CodeCode Available | 2 |
| PredictaBoard: Benchmarking LLM Score Predictability | Feb 20, 2025 | BenchmarkingCommon Sense Reasoning | CodeCode Available | 0 |
| Synthetic Porous Microstructures: Automatic Design, Simulation, and Permeability Analysis | Feb 20, 2025 | Benchmarking | CodeCode Available | 0 |
| Position: There are no Champions in Long-Term Time Series Forecasting | Feb 19, 2025 | BenchmarkingPosition | —Unverified | 0 |
| GIMMICK -- Globally Inclusive Multimodal Multitask Cultural Knowledge Benchmarking | Feb 19, 2025 | Benchmarking | —Unverified | 0 |
| Benchmarking of Different YOLO Models for CAPTCHAs Detection and Classification | Feb 19, 2025 | Benchmarking | —Unverified | 0 |
| A Baseline Method for Removing Invisible Image Watermarks using Deep Image Prior | Feb 19, 2025 | BenchmarkingMisinformation | —Unverified | 0 |
| Benchmarking Self-Supervised Learning Methods for Accelerated MRI Reconstruction | Feb 19, 2025 | BenchmarkingMRI Reconstruction | CodeCode Available | 0 |
| Benchmarking LLMs for Political Science: A United Nations Perspective | Feb 19, 2025 | BenchmarkingDecision Making | CodeCode Available | 1 |
| VITAL: A New Dataset for Benchmarking Pluralistic Alignment in Healthcare | Feb 19, 2025 | BenchmarkingDiversity | —Unverified | 0 |
| Multilingual European Language Models: Benchmarking Approaches and Challenges | Feb 18, 2025 | BenchmarkingQuestion Answering | —Unverified | 0 |
| STEER-ME: Assessing the Microeconomic Reasoning of Large Language Models | Feb 18, 2025 | BenchmarkingLarge Language Model | —Unverified | 0 |
| A deep learning framework for efficient pathology image analysis | Feb 18, 2025 | BenchmarkingDeep Learning | CodeCode Available | 4 |
| Benchmarking Automatic Speech Recognition coupled LLM Modules for Medical Diagnostics | Feb 18, 2025 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 |
| Text2World: Benchmarking Large Language Models for Symbolic World Model Generation | Feb 18, 2025 | Benchmarking | —Unverified | 0 |
| LLMPopcorn: An Empirical Study of LLMs as Assistants for Popular Micro-video Generation | Feb 18, 2025 | BenchmarkingText Generation | —Unverified | 0 |
| EquiBench: Benchmarking Large Language Models' Understanding of Program Semantics via Equivalence Checking | Feb 18, 2025 | BenchmarkingBinary Classification | —Unverified | 0 |
| Reinforcement Learning for Dynamic Resource Allocation in Optical Networks: Hype or Hope? | Feb 18, 2025 | BenchmarkingBlocking | CodeCode Available | 1 |
| Benchmarking Post-Training Quantization in LLMs: Comprehensive Taxonomy, Unified Evaluation, and Comparative Analysis | Feb 18, 2025 | BenchmarkingMamba | CodeCode Available | 0 |
| A new pathway to generative artificial intelligence by minimizing the maximum entropy | Feb 18, 2025 | Benchmarking | —Unverified | 0 |
| Benchmarking MedMNIST dataset on real quantum hardware | Feb 18, 2025 | Benchmarkingimage-classification | —Unverified | 0 |
| Positional Encoding in Transformer-Based Time Series Models: A Survey | Feb 17, 2025 | Anomaly DetectionBenchmarking | CodeCode Available | 1 |
| Integrating Expert Knowledge into Logical Programs via LLMs | Feb 17, 2025 | BenchmarkingLogical Reasoning | CodeCode Available | 0 |
| Defining and Evaluating Visual Language Models' Basic Spatial Abilities: A Perspective from Psychometrics | Feb 17, 2025 | BenchmarkingDiagnostic | —Unverified | 0 |
| ILIAS: Instance-Level Image retrieval At Scale | Feb 17, 2025 | BenchmarkingImage Retrieval | CodeCode Available | 1 |
| HintsOfTruth: A Multimodal Checkworthiness Detection Dataset with Real and Synthetic Claims | Feb 17, 2025 | BenchmarkingFact Checking | CodeCode Available | 1 |
| Language Complexity Measurement as a Noisy Zero-Shot Proxy for Evaluating LLM Performance | Feb 17, 2025 | BenchmarkingDependency Parsing | —Unverified | 0 |
| Knowledge-aware contrastive heterogeneous molecular graph learning | Feb 17, 2025 | BenchmarkingContrastive Learning | —Unverified | 0 |
| Ad-hoc Concept Forming in the Game Codenames as a Means for Evaluating Large Language Models | Feb 17, 2025 | Benchmarking | —Unverified | 0 |
| Plant in Cupboard, Orange on Rably, Inat Aphone. Benchmarking Incremental Learning of Situation and Language Model using a Text-Simulated Situated Environment | Feb 17, 2025 | BenchmarkingCommon Sense Reasoning | —Unverified | 0 |
| Energy-Conscious LLM Decoding: Impact of Text Generation Strategies on GPU Energy Consumption | Feb 17, 2025 | BenchmarkingCode Summarization | —Unverified | 0 |
| Ansatz-free Hamiltonian learning with Heisenberg-limited scaling | Feb 17, 2025 | Benchmarking | —Unverified | 0 |
| JExplore: Design Space Exploration Tool for Nvidia Jetson Boards | Feb 16, 2025 | BenchmarkingGPU | CodeCode Available | 0 |
| TituLLMs: A Family of Bangla LLMs with Comprehensive Benchmarking | Feb 16, 2025 | Benchmarking | —Unverified | 0 |
| Can't See the Forest for the Trees: Benchmarking Multimodal Safety Awareness for Multimodal LLMs | Feb 16, 2025 | Benchmarking | —Unverified | 0 |
| Yesil o1 Pro: Evidence-Based AI Model for Health and Benchmarking in Clinical Decision Support | Feb 15, 2025 | BenchmarkingEpidemiology | —Unverified | 0 |
| User Profile with Large Language Models: Construction, Updating, and Benchmarking | Feb 15, 2025 | BenchmarkingProfile Generation | —Unverified | 0 |
| Generalized Attention Flow: Feature Attribution for Transformer Models via Maximum Flow | Feb 14, 2025 | Benchmarking | —Unverified | 0 |
| LaRA: Benchmarking Retrieval-Augmented Generation and Long-Context LLMs - No Silver Bullet for LC or RAG Routing | Feb 14, 2025 | BenchmarkingRAG | CodeCode Available | 0 |
| MIR-Bench: Can Your LLM Recognize Complicated Patterns via Many-Shot In-Context Reasoning? | Feb 14, 2025 | BenchmarkingIn-Context Learning | —Unverified | 0 |
| Benchmarking the rationality of AI decision making using the transitivity axiom | Feb 14, 2025 | BenchmarkingDecision Making | —Unverified | 0 |
| Forecasting time series with constraints | Feb 14, 2025 | Additive modelsBenchmarking | CodeCode Available | 0 |
| A Survey on LLM-based News Recommender Systems | Feb 13, 2025 | BenchmarkingFairness | —Unverified | 0 |
| AT-Drone: Benchmarking Adaptive Teaming in Multi-Drone Pursuit | Feb 13, 2025 | BenchmarkingEdge-computing | —Unverified | 0 |
| MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency | Feb 13, 2025 | BenchmarkingMath | —Unverified | 0 |
| Do LLMs Recognize Your Preferences? Evaluating Personalized Preference Following in LLMs | Feb 13, 2025 | BenchmarkingRetrieval | CodeCode Available | 1 |
| Beyond the Singular: The Essential Role of Multiple Generations in Effective Benchmark Evaluation and Analysis | Feb 13, 2025 | Benchmarking | —Unverified | 0 |
| Standardisation of Convex Ultrasound Data Through Geometric Analysis and Augmentation | Feb 13, 2025 | Benchmarking | —Unverified | 0 |
| EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents | Feb 13, 2025 | Benchmarking | —Unverified | 0 |
| Zero-shot generation of synthetic neurosurgical data with large language models | Feb 13, 2025 | BenchmarkingDe-identification | CodeCode Available | 0 |