| The Paradox of Success in Evolutionary and Bioinspired Optimization: Revisiting Critical Issues, Key Studies, and Methodological Pathways | Jan 13, 2025 | BenchmarkingMetaheuristic Optimization | —Unverified | 0 |
| TimberVision: A Multi-Task Dataset and Framework for Log-Component Segmentation and Tracking in Autonomous Forestry Operations | Jan 13, 2025 | BenchmarkingDomain Adaptation | CodeCode Available | 1 |
| WebWalker: Benchmarking LLMs in Web Traversal | Jan 13, 2025 | BenchmarkingOpen-Domain Question Answering | CodeCode Available | 11 |
| Lessons From Red Teaming 100 Generative AI Products | Jan 13, 2025 | BenchmarkingRed Teaming | —Unverified | 0 |
| ZNO-Eval: Benchmarking reasoning capabilities of large language models in Ukrainian | Jan 12, 2025 | BenchmarkingMath | CodeCode Available | 1 |
| Benchmarking YOLOv8 for Optimal Crack Detection in Civil Infrastructure | Jan 12, 2025 | BenchmarkingHyperparameter Optimization | —Unverified | 0 |
| Retrieval-Augmented Dialogue Knowledge Aggregation for Expressive Conversational Speech Synthesis | Jan 11, 2025 | AttributeBenchmarking | CodeCode Available | 1 |
| Evidential Deep Learning for Uncertainty Quantification and Out-of-Distribution Detection in Jet Identification using Deep Neural Networks | Jan 10, 2025 | Anomaly DetectionBenchmarking | CodeCode Available | 0 |
| Benchmarking Rotary Position Embeddings for Automatic Speech Recognition | Jan 10, 2025 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 |
| DiffuSETS: 12-lead ECG Generation Conditioned on Clinical Text Reports and Patient-Specific Information | Jan 10, 2025 | BenchmarkingData Augmentation | CodeCode Available | 1 |
| AgoraSpeech: A multi-annotated comprehensive dataset of political discourse through the lens of humans and AI | Jan 9, 2025 | Benchmarkingnamed-entity-recognition | —Unverified | 0 |
| OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding? | Jan 9, 2025 | BenchmarkingVideo Understanding | CodeCode Available | 2 |
| Commonsense Video Question Answering through Video-Grounded Entailment Tree Reasoning | Jan 9, 2025 | BenchmarkingQuestion Answering | —Unverified | 0 |
| CallNavi, A Challenge and Empirical Study on LLM Function Calling and Routing | Jan 9, 2025 | BenchmarkingChatbot | —Unverified | 0 |
| VoxEval: Benchmarking the Knowledge Understanding Capabilities of End-to-End Spoken Language Models | Jan 9, 2025 | BenchmarkingMathematical Problem-Solving | CodeCode Available | 1 |
| Large Physics Models: Towards a collaborative approach with Large Language Models and Foundation Models | Jan 9, 2025 | BenchmarkingPhilosophical Reflection | —Unverified | 0 |
| LongProc: Benchmarking Long-Context Language Models on Long Procedural Generation | Jan 9, 2025 | 2k8k | —Unverified | 0 |
| Open-Source Manually Annotated Vocal Tract Database for Automatic Segmentation from 3D MRI Using Deep Learning: Benchmarking 2D and 3D Convolutional and Transformer Networks | Jan 8, 2025 | BenchmarkingDeep Learning | —Unverified | 0 |
| Advancing Retrieval-Augmented Generation for Persian: Development of Language Models, Comprehensive Benchmarks, and Best Practices for Optimization | Jan 8, 2025 | BenchmarkingGeneral Knowledge | —Unverified | 0 |
| IOLBENCH: Benchmarking LLMs on Linguistic Reasoning | Jan 8, 2025 | Benchmarking | CodeCode Available | 0 |
| An Analysis of Model Robustness across Concurrent Distribution Shifts | Jan 8, 2025 | Benchmarking | —Unverified | 0 |
| Practical Design and Benchmarking of Generative AI Applications for Surgical Billing and Coding | Jan 7, 2025 | BenchmarkingCode Generation | —Unverified | 0 |
| Machine Learning for Identifying Grain Boundaries in Scanning Electron Microscopy (SEM) Images of Nanoparticle Superlattices | Jan 7, 2025 | BenchmarkingClustering | —Unverified | 0 |
| The FACTS Grounding Leaderboard: Benchmarking LLMs' Ability to Ground Responses to Long-Form Input | Jan 6, 2025 | BenchmarkingForm | —Unverified | 0 |
| Underwater Image Restoration Through a Prior Guided Hybrid Sense Approach and Extensive Benchmark Analysis | Jan 6, 2025 | BenchmarkingImage Enhancement | CodeCode Available | 1 |