| Investigating Energy Efficiency and Performance Trade-offs in LLM Inference Across Tasks and DVFS Settings | Jan 14, 2025 | BenchmarkingQuestion Answering | —Unverified | 0 |
| Benchmarking Multimodal Models for Fine-Grained Image Analysis: A Comparative Study Across Diverse Visual Features | Jan 14, 2025 | Benchmarking | —Unverified | 0 |
| Benchmarking Graph Representations and Graph Neural Networks for Multivariate Time Series Classification | Jan 14, 2025 | BenchmarkingGraph Representation Learning | CodeCode Available | 0 |
| The Paradox of Success in Evolutionary and Bioinspired Optimization: Revisiting Critical Issues, Key Studies, and Methodological Pathways | Jan 13, 2025 | BenchmarkingMetaheuristic Optimization | —Unverified | 0 |
| Lessons From Red Teaming 100 Generative AI Products | Jan 13, 2025 | BenchmarkingRed Teaming | —Unverified | 0 |
| Stronger Than You Think: Benchmarking Weak Supervision on Realistic Tasks | Jan 13, 2025 | Benchmarking | CodeCode Available | 0 |
| Benchmarking Abstractive Summarisation: A Dataset of Human-authored Summaries of Norwegian News Articles | Jan 13, 2025 | ArticlesBenchmarking | —Unverified | 0 |
| Understanding and Benchmarking Artificial Intelligence: OpenAI's o3 Is Not AGI | Jan 13, 2025 | ARCBenchmarking | —Unverified | 0 |
| Benchmarking YOLOv8 for Optimal Crack Detection in Civil Infrastructure | Jan 12, 2025 | BenchmarkingHyperparameter Optimization | —Unverified | 0 |
| Evidential Deep Learning for Uncertainty Quantification and Out-of-Distribution Detection in Jet Identification using Deep Neural Networks | Jan 10, 2025 | Anomaly DetectionBenchmarking | CodeCode Available | 0 |
| Benchmarking Rotary Position Embeddings for Automatic Speech Recognition | Jan 10, 2025 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 |
| Large Physics Models: Towards a collaborative approach with Large Language Models and Foundation Models | Jan 9, 2025 | BenchmarkingPhilosophical Reflection | —Unverified | 0 |
| Commonsense Video Question Answering through Video-Grounded Entailment Tree Reasoning | Jan 9, 2025 | BenchmarkingQuestion Answering | —Unverified | 0 |
| CallNavi, A Challenge and Empirical Study on LLM Function Calling and Routing | Jan 9, 2025 | BenchmarkingChatbot | —Unverified | 0 |
| LongProc: Benchmarking Long-Context Language Models on Long Procedural Generation | Jan 9, 2025 | 2k8k | —Unverified | 0 |
| AgoraSpeech: A multi-annotated comprehensive dataset of political discourse through the lens of humans and AI | Jan 9, 2025 | Benchmarkingnamed-entity-recognition | —Unverified | 0 |
| IOLBENCH: Benchmarking LLMs on Linguistic Reasoning | Jan 8, 2025 | Benchmarking | CodeCode Available | 0 |
| Advancing Retrieval-Augmented Generation for Persian: Development of Language Models, Comprehensive Benchmarks, and Best Practices for Optimization | Jan 8, 2025 | BenchmarkingGeneral Knowledge | —Unverified | 0 |
| An Analysis of Model Robustness across Concurrent Distribution Shifts | Jan 8, 2025 | Benchmarking | —Unverified | 0 |
| Open-Source Manually Annotated Vocal Tract Database for Automatic Segmentation from 3D MRI Using Deep Learning: Benchmarking 2D and 3D Convolutional and Transformer Networks | Jan 8, 2025 | BenchmarkingDeep Learning | —Unverified | 0 |
| Machine Learning for Identifying Grain Boundaries in Scanning Electron Microscopy (SEM) Images of Nanoparticle Superlattices | Jan 7, 2025 | BenchmarkingClustering | —Unverified | 0 |
| Practical Design and Benchmarking of Generative AI Applications for Surgical Billing and Coding | Jan 7, 2025 | BenchmarkingCode Generation | —Unverified | 0 |
| The FACTS Grounding Leaderboard: Benchmarking LLMs' Ability to Ground Responses to Long-Form Input | Jan 6, 2025 | BenchmarkingForm | —Unverified | 0 |
| MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models | Jan 6, 2025 | BenchmarkingFeature Compression | —Unverified | 0 |
| Tougher Text, Smarter Models: Raising the Bar for Adversarial Defence Benchmarks | Jan 5, 2025 | Adversarial RobustnessBenchmarking | CodeCode Available | 0 |
| ANTHROPOS-V: benchmarking the novel task of Crowd Volume Estimation | Jan 3, 2025 | BenchmarkingCrowd Counting | CodeCode Available | 0 |
| QuArch: A Question-Answering Dataset for AI Agents in Computer Architecture | Jan 3, 2025 | BenchmarkingQuestion Answering | —Unverified | 0 |
| AI-Powered Cow Detection in Complex Farm Environments | Jan 3, 2025 | Benchmarking | —Unverified | 0 |
| PSYCHE: A Multi-faceted Patient Simulation Framework for Evaluation of Psychiatric Assessment Conversational Agents | Jan 3, 2025 | Benchmarking | —Unverified | 0 |
| Benchmarking Constraint-Based Bayesian Structure Learning Algorithms: Role of Network Topology | Jan 2, 2025 | BenchmarkingSensitivity | —Unverified | 0 |
| CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings | Jan 2, 2025 | BenchmarkingCode Generation | —Unverified | 0 |
| BoxingGym: Benchmarking Progress in Automated Experimental Design and Model Discovery | Jan 2, 2025 | BenchmarkingExperimental Design | CodeCode Available | 0 |
| TabTreeFormer: Tabular Data Generation Using Hybrid Tree-Transformer | Jan 2, 2025 | BenchmarkingQuantization | —Unverified | 0 |
| MSC-Bench: Benchmarking and Analyzing Multi-Sensor Corruption for Driving Perception | Jan 2, 2025 | 3D Object DetectionAutonomous Driving | —Unverified | 0 |
| State-of-the-art AI-based Learning Approaches for Deepfake Generation and Detection, Analyzing Opportunities, Threading through Pros, Cons, and Future Prospects | Jan 2, 2025 | BenchmarkingFace Swapping | —Unverified | 0 |
| RCP-Bench: Benchmarking Robustness for Collaborative Perception Under Diverse Corruptions | Jan 1, 2025 | Benchmarking | CodeCode Available | 0 |
| CroCoDL: Cross-device Collaborative Dataset for Localization | Jan 1, 2025 | BenchmarkingPose Estimation | —Unverified | 0 |
| Six-CD: Benchmarking Concept Removals for Text-to-image Diffusion Models | Jan 1, 2025 | Benchmarking | —Unverified | 0 |
| CholecTrack20: A Multi-Perspective Tracking Dataset for Surgical Tools | Jan 1, 2025 | Benchmarking | —Unverified | 0 |
| CheXwhatsApp: A Dataset for Exploring Challenges in the Diagnosis of Chest X-rays through Mobile Devices | Jan 1, 2025 | Benchmarking | —Unverified | 0 |
| Segmenting Maxillofacial Structures in CBCT Volumes | Jan 1, 2025 | AnatomyBenchmarking | —Unverified | 0 |
| Sketchtopia: A Dataset and Foundational Agents for Benchmarking Asynchronous Multimodal Communication with Iconic Feedback | Jan 1, 2025 | Benchmarking | —Unverified | 0 |
| InterAct: Advancing Large-Scale Versatile 3D Human-Object Interaction Generation | Jan 1, 2025 | BenchmarkingHuman-Object Interaction Detection | —Unverified | 0 |
| SVLTA: Benchmarking Vision-Language Temporal Alignment via Synthetic Video Situation | Jan 1, 2025 | BenchmarkingDiagnostic | —Unverified | 0 |
| On the Utility of Equivariance and Symmetry Breaking in Deep Learning Architectures on Point Clouds | Jan 1, 2025 | Benchmarking | —Unverified | 0 |
| Geometry Matters: Benchmarking Scientific ML Approaches for Flow Prediction around Complex Geometries | Dec 31, 2024 | BenchmarkingOut-of-Distribution Generalization | —Unverified | 0 |
| A review of faithfulness metrics for hallucination assessment in Large Language Models | Dec 31, 2024 | BenchmarkingHallucination | —Unverified | 0 |
| AraSTEM: A Native Arabic Multiple Choice Question Benchmark for Evaluating LLMs Knowledge In STEM Subjects | Dec 31, 2024 | BenchmarkingMultiple-choice | —Unverified | 0 |
| Measuring Large Language Models Capacity to Annotate Journalistic Sourcing | Dec 30, 2024 | BenchmarkingEthics | —Unverified | 0 |
| SecBench: A Comprehensive Multi-Dimensional Benchmarking Dataset for LLMs in Cybersecurity | Dec 30, 2024 | BenchmarkingCode Generation | —Unverified | 0 |