| Benchmarking Practices in LLM-driven Offensive Security: Testbeds, Metrics, and Experiment Design | Apr 14, 2025 | BenchmarkingLanguage Modeling | —Unverified | 0 |
| LEMUR Neural Network Dataset: Towards Seamless AutoML | Apr 14, 2025 | AutoMLBenchmarking | CodeCode Available | 1 |
| NoTeS-Bank: Benchmarking Neural Transcription and Search for Scientific Notes Understanding | Apr 12, 2025 | BenchmarkingDocument AI | —Unverified | 0 |
| TP-RAG: Benchmarking Retrieval-Augmented Large Language Model Agents for Spatiotemporal-Aware Travel Planning | Apr 11, 2025 | BenchmarkingLanguage Modeling | —Unverified | 0 |
| LMM4LMM: Benchmarking and Evaluating Large-multimodal Image Generation with LMMs | Apr 11, 2025 | BenchmarkingImage Generation | CodeCode Available | 1 |
| SortBench: Benchmarking LLMs based on their ability to sort lists | Apr 11, 2025 | Benchmarking | —Unverified | 0 |
| TorchFX: A modern approach to Audio DSP with PyTorch and GPU acceleration | Apr 11, 2025 | Audio Signal ProcessingBenchmarking | CodeCode Available | 2 |
| Adaptive Shrinkage Estimation For Personalized Deep Kernel Regression In Modeling Brain Trajectories | Apr 10, 2025 | Additive modelsBenchmarking | CodeCode Available | 0 |
| Benchmarking Suite for Synthetic Aperture Radar Imagery Anomaly Detection (SARIAD) Algorithms | Apr 10, 2025 | Anomaly DetectionBenchmarking | CodeCode Available | 0 |
| NorEval: A Norwegian Language Understanding and Generation Evaluation Benchmark | Apr 10, 2025 | Benchmarking | CodeCode Available | 0 |
| SydneyScapes: Image Segmentation for Australian Environments | Apr 10, 2025 | Autonomous VehiclesBenchmarking | —Unverified | 0 |
| Geological Inference from Textual Data using Word Embeddings | Apr 10, 2025 | BenchmarkingWord Embeddings | CodeCode Available | 0 |
| Benchmarking Multi-Organ Segmentation Tools for Multi-Parametric T1-weighted Abdominal MRI | Apr 10, 2025 | BenchmarkingOrgan Segmentation | —Unverified | 0 |
| Benchmarking Image Embeddings for E-Commerce: Evaluating Off-the Shelf Foundation Models, Fine-Tuning Strategies and Practical Trade-offs | Apr 10, 2025 | BenchmarkingContrastive Learning | —Unverified | 0 |
| Benchmarking Adversarial Robustness to Bias Elicitation in Large Language Models: Scalable Automated Assessment with LLM-as-a-Judge | Apr 10, 2025 | Adversarial RobustnessBenchmarking | CodeCode Available | 0 |
| Benchmarking Multimodal CoT Reward Model Stepwise by Visual Program | Apr 9, 2025 | Benchmarking | CodeCode Available | 0 |
| TabKAN: Advancing Tabular Data Analysis using Kolmogorov-Arnold Network | Apr 9, 2025 | BenchmarkingDeep Learning | —Unverified | 0 |
| Evolutionary Generation of Random Surreal Numbers for Benchmarking | Apr 9, 2025 | Benchmarking | CodeCode Available | 1 |
| A Roadmap for Improving Data Reliability and Sharing in Crosslinking Mass Spectrometry | Apr 9, 2025 | Benchmarking | —Unverified | 0 |
| RayFronts: Open-Set Semantic Ray Frontiers for Online Scene Understanding and Exploration | Apr 9, 2025 | 3D Semantic SegmentationBenchmarking | —Unverified | 0 |
| Can Carbon-Aware Electric Load Shifting Reduce Emissions? An Equilibrium-Based Analysis | Apr 9, 2025 | Benchmarking | —Unverified | 0 |
| Benchmarking Convolutional Neural Network and Graph Neural Network based Surrogate Models on a Real-World Car External Aerodynamics Dataset | Apr 9, 2025 | BenchmarkingGraph Neural Network | —Unverified | 0 |
| V-MAGE: A Game Evaluation Framework for Assessing Vision-Centric Capabilities in Multimodal Large Language Models | Apr 8, 2025 | BenchmarkingVisual Reasoning | CodeCode Available | 1 |
| An Empirical Study of GPT-4o Image Generation Capabilities | Apr 8, 2025 | BenchmarkingImage Generation | CodeCode Available | 1 |
| Towards Visual Text Grounding of Multimodal Large Language Model | Apr 7, 2025 | BenchmarkingLanguage Modeling | —Unverified | 0 |
| SCAM: A Real-World Typographic Robustness Evaluation for Multimodal Foundation Models | Apr 7, 2025 | Benchmarking | CodeCode Available | 0 |
| Leveraging State Space Models in Long Range Genomics | Apr 7, 2025 | BenchmarkingGPU | —Unverified | 0 |
| Generative Adversarial Networks with Limited Data: A Survey and Benchmarking | Apr 7, 2025 | BenchmarkingImage Generation | —Unverified | 0 |
| Riemannian Geometry for the classification of brain states with intracortical brain-computer interfaces | Apr 7, 2025 | BenchmarkingBrain Computer Interface | —Unverified | 0 |
| Cross-functional transferability in universal machine learning interatomic potentials | Apr 7, 2025 | BenchmarkingTransfer Learning | —Unverified | 0 |
| A Solid-State Nanopore Signal Generator for Training Machine Learning Models | Apr 7, 2025 | BenchmarkingEvent Detection | —Unverified | 0 |
| Prism: Dynamic and Flexible Benchmarking of LLMs Code Generation with Monte Carlo Tree Search | Apr 7, 2025 | BenchmarkingCode Generation | —Unverified | 0 |
| Subjective Visual Quality Assessment for High-Fidelity Learning-Based Image Compression | Apr 7, 2025 | BenchmarkingImage Compression | CodeCode Available | 0 |
| Are You Getting What You Pay For? Auditing Model Substitution in LLM APIs | Apr 7, 2025 | BenchmarkingFairness | CodeCode Available | 0 |
| CO-Bench: Benchmarking Language Model Agents in Algorithm Search for Combinatorial Optimization | Apr 6, 2025 | BenchmarkingCombinatorial Optimization | CodeCode Available | 1 |
| A Survey of Pathology Foundation Model: Progress and Future Directions | Apr 5, 2025 | BenchmarkingMultiple Instance Learning | CodeCode Available | 1 |
| Do LLM Evaluators Prefer Themselves for a Reason? | Apr 4, 2025 | BenchmarkingCode Generation | CodeCode Available | 0 |
| Can AI Master Construction Management (CM)? Benchmarking State-of-the-Art Large Language Models on CM Certification Exams | Apr 4, 2025 | BenchmarkingManagement | —Unverified | 0 |
| Point Cloud Objective Quality: Benchmarking Features and Quality Evaluation | Apr 4, 2025 | AttributeBenchmarking | —Unverified | 0 |
| Quantifying Robustness: A Benchmarking Framework for Deep Learning Forecasting in Cyber-Physical Systems | Apr 4, 2025 | BenchmarkingModel Selection | CodeCode Available | 0 |
| Towards a Unified Framework for Determining Conformational Ensembles of Disordered Proteins | Apr 4, 2025 | Benchmarking | —Unverified | 0 |
| MME-Unify: A Comprehensive Benchmark for Unified Multimodal Understanding and Generation Models | Apr 4, 2025 | BenchmarkingImage Generation | —Unverified | 0 |
| Sustainable LLM Inference for Edge AI: Evaluating Quantized LLMs for Energy Efficiency, Output Accuracy, and Inference Latency | Apr 4, 2025 | BenchmarkingGSM8K | —Unverified | 0 |
| Detecting Stereotypes and Anti-stereotypes the Correct Way Using Social Psychological Underpinnings | Apr 4, 2025 | Benchmarking | CodeCode Available | 0 |
| Evaluating AI Recruitment Sourcing Tools by Human Preference | Apr 3, 2025 | Benchmarking | CodeCode Available | 0 |
| Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing | Apr 3, 2025 | BenchmarkingLogical Reasoning | CodeCode Available | 2 |
| Generative Evaluation of Complex Reasoning in Large Language Models | Apr 3, 2025 | BenchmarkingMemorization | CodeCode Available | 1 |
| Benchmark of Segmentation Techniques for Pelvic Fracture in CT and X-ray: Summary of the PENGWIN 2024 Challenge | Apr 3, 2025 | AnatomyBenchmarking | —Unverified | 0 |
| Global Rice Multi-Class Segmentation Dataset (RiceSEG): A Comprehensive and Diverse High-Resolution RGB-Annotated Images for the Development and Benchmarking of Rice Segmentation Algorithms | Apr 2, 2025 | BenchmarkingSemantic Segmentation | —Unverified | 0 |
| Better Bill GPT: Comparing Large Language Models against Legal Invoice Reviewers | Apr 2, 2025 | BenchmarkingManagement | —Unverified | 0 |