| Better Bill GPT: Comparing Large Language Models against Legal Invoice Reviewers | Apr 2, 2025 | BenchmarkingManagement | —Unverified | 0 |
| When Reasoning Meets Compression: Benchmarking Compressed Large Reasoning Models on Complex Reasoning Tasks | Apr 2, 2025 | BenchmarkingLanguage Modeling | —Unverified | 0 |
| Benchmarking the Spatial Robustness of DNNs via Natural and Adversarial Localized Corruptions | Apr 2, 2025 | BenchmarkingSegmentation | —Unverified | 0 |
| BlenderGym: Benchmarking Foundational Model Systems for Graphics Editing | Apr 2, 2025 | 3D ReconstructionBenchmarking | CodeCode Available | 1 |
| FIORD: A Fisheye Indoor-Outdoor Dataset with LIDAR Ground Truth for 3D Scene Reconstruction and Benchmarking | Apr 2, 2025 | 3D Scene ReconstructionBenchmarking | —Unverified | 0 |
| Horizon Scans can be accelerated using novel information retrieval and artificial intelligence tools | Apr 2, 2025 | Active LearningArticles | —Unverified | 0 |
| Accelerating IoV Intrusion Detection: Benchmarking GPU-Accelerated vs CPU-Based ML Libraries | Apr 2, 2025 | BenchmarkingComputational Efficiency | —Unverified | 0 |
| Benchmarking Synthetic Tabular Data: A Multi-Dimensional Evaluation Framework | Apr 2, 2025 | BenchmarkingSynthetic Data Generation | CodeCode Available | 2 |
| Zero-shot Benchmarking: A Framework for Flexible and Scalable Automatic Evaluation of Language Models | Apr 1, 2025 | Benchmarking | —Unverified | 0 |
| TDBench: Benchmarking Vision-Language Models in Understanding Top-Down Images | Apr 1, 2025 | Autonomous NavigationBenchmarking | CodeCode Available | 0 |
| Scaling Up Resonate-and-Fire Networks for Fast Deep Learning | Apr 1, 2025 | BenchmarkingDeep Learning | CodeCode Available | 0 |
| Benchmarking Federated Machine Unlearning methods for Tabular Data | Apr 1, 2025 | BenchmarkingComputational Efficiency | —Unverified | 0 |
| Automated Factual Benchmarking for In-Car Conversational Systems using Large Language Models | Apr 1, 2025 | BenchmarkingConversational Question Answering | —Unverified | 0 |
| Can LLMs Grasp Implicit Cultural Values? Benchmarking LLMs' Metacognitive Cultural Intelligence with CQ-Bench | Apr 1, 2025 | Benchmarking | CodeCode Available | 0 |
| LOCO-EPI: Leave-one-chromosome-out (LOCO) as a benchmarking paradigm for deep learning based prediction of enhancer-promoter interactions | Apr 1, 2025 | Benchmarking | CodeCode Available | 0 |
| On Benchmarking Code LLMs for Android Malware Analysis | Apr 1, 2025 | BenchmarkingMalware Analysis | —Unverified | 0 |
| SciReplicate-Bench: Benchmarking LLMs in Agent-driven Algorithmic Reproduction from Research Papers | Mar 31, 2025 | Benchmarking | CodeCode Available | 1 |
| Uni-Render: A Unified Accelerator for Real-Time Rendering Across Diverse Neural Renderers | Mar 31, 2025 | BenchmarkingNeural Rendering | —Unverified | 0 |
| Towards Benchmarking and Assessing the Safety and Robustness of Autonomous Driving on Safety-critical Scenarios | Mar 31, 2025 | Adversarial AttackAutonomous Driving | —Unverified | 0 |
| Simple Feedfoward Neural Networks are Almost All You Need for Time Series Forecasting | Mar 30, 2025 | AllBenchmarking | —Unverified | 0 |
| Benchmarking Systematic Relational Reasoning with Large Language and Reasoning Models | Mar 30, 2025 | BenchmarkingRelational Reasoning | —Unverified | 0 |
| MHTS: Multi-Hop Tree Structure Framework for Generating Difficulty-Controllable QA Datasets for RAG Evaluation | Mar 29, 2025 | Answer GenerationBenchmarking | —Unverified | 0 |
| Unsupervised Anomaly Detection in Multivariate Time Series across Heterogeneous Domains | Mar 29, 2025 | Anomaly DetectionBenchmarking | CodeCode Available | 0 |
| RL2Grid: Benchmarking Reinforcement Learning in Power Grid Operations | Mar 29, 2025 | Benchmarkingreinforcement-learning | —Unverified | 0 |
| CodeARC: Benchmarking Reasoning Capabilities of LLM Agents for Inductive Program Synthesis | Mar 29, 2025 | BenchmarkingLarge Language Model | —Unverified | 0 |
| SimBank: from Simulation to Solution in Prescriptive Process Monitoring | Mar 28, 2025 | Benchmarking | —Unverified | 0 |
| Generalization Bias in Large Language Model Summarization of Scientific Research | Mar 28, 2025 | BenchmarkingLanguage Modeling | —Unverified | 0 |
| EgoToM: Benchmarking Theory of Mind Reasoning from Egocentric Videos | Mar 28, 2025 | BenchmarkingQuestion Answering | CodeCode Available | 1 |
| Why Stop at One Error? Benchmarking LLMs as Data Science Code Debuggers for Multi-Hop and Multi-Bug Errors | Mar 28, 2025 | BenchmarkingCode Generation | CodeCode Available | 0 |
| Benchmarking Ultra-Low-Power μNPUs | Mar 28, 2025 | Benchmarking | —Unverified | 0 |
| An Advanced Ensemble Deep Learning Framework for Stock Price Prediction Using VAE, Transformer, and LSTM Model | Mar 28, 2025 | Algorithmic TradingBenchmarking | —Unverified | 0 |
| LIM: Large Interpolator Model for Dynamic Reconstruction | Mar 28, 2025 | 4D reconstructionBenchmarking | —Unverified | 0 |
| Assessing Foundation Models for Sea Ice Type Segmentation in Sentinel-1 SAR Imagery | Mar 28, 2025 | BenchmarkingSegmentation | —Unverified | 0 |
| Benchmarking Deep Learning-Based Methods for Irradiance Nowcasting with Sky Images | Mar 27, 2025 | Benchmarking | —Unverified | 0 |
| CLAIMCHECK: How Grounded are LLM Critiques of Scientific Papers? | Mar 27, 2025 | BenchmarkingSpecificity | CodeCode Available | 0 |
| Evaluating Text-to-Image Synthesis with a Conditional Fréchet Distance | Mar 27, 2025 | BenchmarkingImage Generation | —Unverified | 0 |
| ResearchBench: Benchmarking LLMs in Scientific Discovery via Inspiration-Based Task Decomposition | Mar 27, 2025 | Benchmarkingscientific discovery | —Unverified | 0 |
| GateLens: A Reasoning-Enhanced LLM Agent for Automotive Software Release Analytics | Mar 27, 2025 | BenchmarkingNatural Language Queries | —Unverified | 0 |
| FaceBench: A Multi-View Multi-Level Facial Attribute VQA Dataset for Benchmarking Face Perception MLLMs | Mar 27, 2025 | AttributeBenchmarking | CodeCode Available | 1 |
| A Comprehensive Benchmark for RNA 3D Structure-Function Modeling | Mar 27, 2025 | BenchmarkingDeep Learning | CodeCode Available | 1 |
| CSPO: Cross-Market Synergistic Stock Price Movement Forecasting with Pseudo-volatility Optimization | Mar 26, 2025 | Benchmarking | —Unverified | 0 |
| Can geometric combinatorics improve RNA branching predictions? | Mar 26, 2025 | Benchmarking | CodeCode Available | 0 |
| RxRx3-core: Benchmarking drug-target interactions in High-Content Microscopy | Mar 26, 2025 | BenchmarkingRepresentation Learning | —Unverified | 0 |
| StableToolBench-MirrorAPI: Modeling Tool Environments as Mirrors of 7,000+ Real-World APIs | Mar 26, 2025 | Benchmarking | CodeCode Available | 3 |
| Benchmarking and optimizing organism wide single-cell RNA alignment methods | Mar 26, 2025 | BenchmarkingDecoder | CodeCode Available | 0 |
| TerraTorch: The Geospatial Foundation Models Toolkit | Mar 26, 2025 | BenchmarkingDecoder | CodeCode Available | 4 |
| Benchmarking Machine Learning Methods for Distributed Acoustic Sensing | Mar 26, 2025 | BenchmarkingData Augmentation | —Unverified | 0 |
| Reservoir Computing with a Single Oscillating Gas Bubble: Emphasizing the Chaotic Regime | Mar 25, 2025 | BenchmarkingLearning Theory | —Unverified | 0 |
| Contextual Metric Meta-Evaluation by Measuring Local Metric Accuracy | Mar 25, 2025 | Benchmarkingspeech-recognition | —Unverified | 0 |
| Mind the Gap: Benchmarking Spatial Reasoning in Vision-Language Models | Mar 25, 2025 | BenchmarkingImage Captioning | CodeCode Available | 1 |