| Visual-RAG: Benchmarking Text-to-Image Retrieval Augmented Generation for Visual Knowledge Intensive Queries | Feb 23, 2025 | BenchmarkingImage Retrieval | CodeCode Available | 0 |
| Benchmarking Online Object Trackers for Underwater Robot Position Locking Applications | Feb 23, 2025 | BenchmarkingObject Tracking | —Unverified | 0 |
| On Neural Inertial Classification Networks for Pedestrian Activity Recognition | Feb 23, 2025 | Activity RecognitionBenchmarking | —Unverified | 0 |
| MHQA: A Diverse, Knowledge Intensive Mental Health Question Answering Challenge for Language Models | Feb 21, 2025 | BenchmarkingDiagnostic | —Unverified | 0 |
| Bridging vision language model (VLM) evaluation gaps with a framework for scalable and cost-effective benchmark generation | Feb 21, 2025 | BenchmarkingLanguage Modeling | —Unverified | 0 |
| Para-Lane: Multi-Lane Dataset Registering Parallel Scans for Benchmarking Novel View Synthesis | Feb 21, 2025 | 3DGSAutonomous Driving | —Unverified | 0 |
| Methods and Trends in Detecting Generated Images: A Comprehensive Review | Feb 21, 2025 | BenchmarkingDeepFake Detection | —Unverified | 0 |
| Benchmarking machine learning for bowel sound pattern classification from tabular features to pretrained models | Feb 21, 2025 | BenchmarkingDiagnostic | CodeCode Available | 0 |
| PredictaBoard: Benchmarking LLM Score Predictability | Feb 20, 2025 | BenchmarkingCommon Sense Reasoning | CodeCode Available | 0 |
| Beyond Self-Talk: A Communication-Centric Survey of LLM-Based Multi-Agent Systems | Feb 20, 2025 | BenchmarkingDecision Making | —Unverified | 0 |
| Position: Graph Learning Will Lose Relevance Due To Poor Benchmarks | Feb 20, 2025 | BenchmarkingCombinatorial Optimization | —Unverified | 0 |
| Line Goes Up? Inherent Limitations of Benchmarks for Evaluating Large Language Models | Feb 20, 2025 | Benchmarking | —Unverified | 0 |
| Synthetic Porous Microstructures: Automatic Design, Simulation, and Permeability Analysis | Feb 20, 2025 | Benchmarking | CodeCode Available | 0 |
| Sentence Smith: Formally Controllable Text Transformation and its Application to Evaluation of Text Embedding Models | Feb 20, 2025 | BenchmarkingSentence | —Unverified | 0 |
| Probabilistic Robustness in Deep Learning: A Concise yet Comprehensive Guide | Feb 20, 2025 | Adversarial RobustnessBenchmarking | —Unverified | 0 |
| Statistical Scenario Modelling and Lookalike Distributions for Multi-Variate AI Risk | Feb 20, 2025 | Benchmarking | —Unverified | 0 |
| Benchmarking Multimodal RAG through a Chart-based Document Question-Answering Generation Framework | Feb 20, 2025 | BenchmarkingQuestion Answering | CodeCode Available | 0 |
| Reinforcement Learning with Graph Attention for Routing and Wavelength Assignment with Lightpath Reuse | Feb 20, 2025 | BenchmarkingGraph Attention | —Unverified | 0 |
| Position: There are no Champions in Long-Term Time Series Forecasting | Feb 19, 2025 | BenchmarkingPosition | —Unverified | 0 |
| A Baseline Method for Removing Invisible Image Watermarks using Deep Image Prior | Feb 19, 2025 | BenchmarkingMisinformation | —Unverified | 0 |
| Benchmarking Self-Supervised Learning Methods for Accelerated MRI Reconstruction | Feb 19, 2025 | BenchmarkingMRI Reconstruction | CodeCode Available | 0 |
| Benchmarking of Different YOLO Models for CAPTCHAs Detection and Classification | Feb 19, 2025 | Benchmarking | —Unverified | 0 |
| GIMMICK -- Globally Inclusive Multimodal Multitask Cultural Knowledge Benchmarking | Feb 19, 2025 | Benchmarking | —Unverified | 0 |
| VITAL: A New Dataset for Benchmarking Pluralistic Alignment in Healthcare | Feb 19, 2025 | BenchmarkingDiversity | —Unverified | 0 |
| EquiBench: Benchmarking Large Language Models' Understanding of Program Semantics via Equivalence Checking | Feb 18, 2025 | BenchmarkingBinary Classification | —Unverified | 0 |
| Benchmarking Automatic Speech Recognition coupled LLM Modules for Medical Diagnostics | Feb 18, 2025 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 |
| Text2World: Benchmarking Large Language Models for Symbolic World Model Generation | Feb 18, 2025 | Benchmarking | —Unverified | 0 |
| A new pathway to generative artificial intelligence by minimizing the maximum entropy | Feb 18, 2025 | Benchmarking | —Unverified | 0 |
| Benchmarking Post-Training Quantization in LLMs: Comprehensive Taxonomy, Unified Evaluation, and Comparative Analysis | Feb 18, 2025 | BenchmarkingMamba | CodeCode Available | 0 |
| Multilingual European Language Models: Benchmarking Approaches and Challenges | Feb 18, 2025 | BenchmarkingQuestion Answering | —Unverified | 0 |
| STEER-ME: Assessing the Microeconomic Reasoning of Large Language Models | Feb 18, 2025 | BenchmarkingLarge Language Model | —Unverified | 0 |
| Benchmarking MedMNIST dataset on real quantum hardware | Feb 18, 2025 | Benchmarkingimage-classification | —Unverified | 0 |
| LLMPopcorn: An Empirical Study of LLMs as Assistants for Popular Micro-video Generation | Feb 18, 2025 | BenchmarkingText Generation | —Unverified | 0 |
| Energy-Conscious LLM Decoding: Impact of Text Generation Strategies on GPU Energy Consumption | Feb 17, 2025 | BenchmarkingCode Summarization | —Unverified | 0 |
| Ansatz-free Hamiltonian learning with Heisenberg-limited scaling | Feb 17, 2025 | Benchmarking | —Unverified | 0 |
| Ad-hoc Concept Forming in the Game Codenames as a Means for Evaluating Large Language Models | Feb 17, 2025 | Benchmarking | —Unverified | 0 |
| Knowledge-aware contrastive heterogeneous molecular graph learning | Feb 17, 2025 | BenchmarkingContrastive Learning | —Unverified | 0 |
| Language Complexity Measurement as a Noisy Zero-Shot Proxy for Evaluating LLM Performance | Feb 17, 2025 | BenchmarkingDependency Parsing | —Unverified | 0 |
| Integrating Expert Knowledge into Logical Programs via LLMs | Feb 17, 2025 | BenchmarkingLogical Reasoning | CodeCode Available | 0 |
| Plant in Cupboard, Orange on Rably, Inat Aphone. Benchmarking Incremental Learning of Situation and Language Model using a Text-Simulated Situated Environment | Feb 17, 2025 | BenchmarkingCommon Sense Reasoning | —Unverified | 0 |
| Defining and Evaluating Visual Language Models' Basic Spatial Abilities: A Perspective from Psychometrics | Feb 17, 2025 | BenchmarkingDiagnostic | —Unverified | 0 |
| JExplore: Design Space Exploration Tool for Nvidia Jetson Boards | Feb 16, 2025 | BenchmarkingGPU | CodeCode Available | 0 |
| Can't See the Forest for the Trees: Benchmarking Multimodal Safety Awareness for Multimodal LLMs | Feb 16, 2025 | Benchmarking | —Unverified | 0 |
| TituLLMs: A Family of Bangla LLMs with Comprehensive Benchmarking | Feb 16, 2025 | Benchmarking | —Unverified | 0 |
| User Profile with Large Language Models: Construction, Updating, and Benchmarking | Feb 15, 2025 | BenchmarkingProfile Generation | —Unverified | 0 |
| Yesil o1 Pro: Evidence-Based AI Model for Health and Benchmarking in Clinical Decision Support | Feb 15, 2025 | BenchmarkingEpidemiology | —Unverified | 0 |
| LaRA: Benchmarking Retrieval-Augmented Generation and Long-Context LLMs - No Silver Bullet for LC or RAG Routing | Feb 14, 2025 | BenchmarkingRAG | CodeCode Available | 0 |
| MIR-Bench: Can Your LLM Recognize Complicated Patterns via Many-Shot In-Context Reasoning? | Feb 14, 2025 | BenchmarkingIn-Context Learning | —Unverified | 0 |
| Generalized Attention Flow: Feature Attribution for Transformer Models via Maximum Flow | Feb 14, 2025 | Benchmarking | —Unverified | 0 |
| Benchmarking the rationality of AI decision making using the transitivity axiom | Feb 14, 2025 | BenchmarkingDecision Making | —Unverified | 0 |