| OpenDeception: Benchmarking and Investigating AI Deceptive Behaviors via Open-ended Interaction Simulation | Apr 18, 2025 | Benchmarking | —Unverified | 0 |
| THOUGHTTERMINATOR: Benchmarking, Calibrating, and Mitigating Overthinking in Reasoning Models | Apr 17, 2025 | BenchmarkingMath | —Unverified | 0 |
| Featuremetric benchmarking: Quantum computer benchmarks based on circuit features | Apr 17, 2025 | Benchmarking | —Unverified | 0 |
| ALT: A Python Package for Lightweight Feature Representation in Time Series Classification | Apr 17, 2025 | BenchmarkingTime Series | —Unverified | 0 |
| Local Data Quantity-Aware Weighted Averaging for Federated Learning with Dishonest Clients | Apr 17, 2025 | BenchmarkingFederated Learning | —Unverified | 0 |
| Benchmarking Multi-National Value Alignment for Large Language Models | Apr 17, 2025 | Benchmarking | —Unverified | 0 |
| Benchmarking LLM-based Relevance Judgment Methods | Apr 17, 2025 | BenchmarkingOpen-Domain Question Answering | CodeCode Available | 0 |
| Enhancing Explainability and Reliable Decision-Making in Particle Swarm Optimization through Communication Topologies | Apr 17, 2025 | BenchmarkingDecision Making | —Unverified | 0 |
| Continual Learning Strategies for 3D Engineering Regression Problems: A Benchmarking Study | Apr 16, 2025 | BenchmarkingContinual Learning | CodeCode Available | 0 |
| Securing the Skies: A Comprehensive Survey on Anti-UAV Methods, Benchmarking, and Future Directions | Apr 16, 2025 | BenchmarkingLanguage Modeling | —Unverified | 0 |
| Benchmarking Audio Deepfake Detection Robustness in Real-world Communication Scenarios | Apr 16, 2025 | Audio Deepfake DetectionBenchmarking | —Unverified | 0 |
| Causality-enhanced Decision-Making for Autonomous Mobile Robots in Dynamic Environments | Apr 16, 2025 | BenchmarkingCausal Inference | CodeCode Available | 0 |
| Power Line Communication vs. Talkative Power Conversion: A Benchmarking Study | Apr 16, 2025 | Benchmarking | —Unverified | 0 |
| pix2pockets: Shot Suggestions in 8-Ball Pool from a Single Image in the Wild | Apr 16, 2025 | Benchmarkingobject-detection | —Unverified | 0 |
| Benchmarking Mutual Information-based Loss Functions in Federated Learning | Apr 16, 2025 | BenchmarkingFairness | —Unverified | 0 |
| E2E Parking Dataset: An Open Benchmark for End-to-End Autonomous Parking | Apr 15, 2025 | BenchmarkingPosition | —Unverified | 0 |
| Benchmarking Biopharmaceuticals Retrieval-Augmented Generation Evaluation | Apr 15, 2025 | BenchmarkingQuestion Answering | —Unverified | 0 |
| CLASH: Evaluating Language Models on Judging High-Stakes Dilemmas from Multiple Perspectives | Apr 15, 2025 | Benchmarking | —Unverified | 0 |
| GaSLight: Gaussian Splats for Spatially-Varying Lighting in HDR | Apr 15, 2025 | Benchmarking | —Unverified | 0 |
| BEACON: A Benchmark for Efficient and Accurate Counting of Subgraphs | Apr 15, 2025 | BenchmarkingSubgraph Counting | —Unverified | 0 |
| Mamba-Based Ensemble learning for White Blood Cell Classification | Apr 15, 2025 | BenchmarkingClassification | CodeCode Available | 0 |
| Benchmarking Vision Language Models on German Factual Data | Apr 15, 2025 | Benchmarking | —Unverified | 0 |
| FHBench: Towards Efficient and Personalized Federated Learning for Multimodal Healthcare | Apr 15, 2025 | BenchmarkingDiagnostic | CodeCode Available | 0 |
| Benchmarking Next-Generation Reasoning-Focused Large Language Models in Ophthalmology: A Head-to-Head Evaluation on 5,888 Items | Apr 15, 2025 | BenchmarkingMultiple-choice | —Unverified | 0 |
| Trade-offs in Privacy-Preserving Eye Tracking through Iris Obfuscation: A Benchmarking Study | Apr 14, 2025 | BenchmarkingGaze Estimation | CodeCode Available | 0 |
| COUNTS: Benchmarking Object Detectors and Multimodal Large Language Models under Distribution Shifts | Apr 14, 2025 | BenchmarkingObject | —Unverified | 0 |
| Foundation Models for Remote Sensing: An Analysis of MLLMs for Object Localization | Apr 14, 2025 | BenchmarkingEarth Observation | —Unverified | 0 |
| LMFormer: Lane based Motion Prediction Transformer | Apr 14, 2025 | Autonomous DrivingBenchmarking | —Unverified | 0 |
| CameraBench: Benchmarking Visual Reasoning in MLLMs via Photography | Apr 14, 2025 | BenchmarkingVisual Reasoning | —Unverified | 0 |
| BoTTA: Benchmarking on-device Test Time Adaptation | Apr 14, 2025 | BenchmarkingTest-time Adaptation | —Unverified | 0 |
| Benchmarking 3D Human Pose Estimation Models Under Occlusions | Apr 14, 2025 | 3D Human Pose EstimationBenchmarking | —Unverified | 0 |
| Beyond Chains of Thought: Benchmarking Latent-Space Reasoning Abilities in Large Language Models | Apr 14, 2025 | BenchmarkingDescriptive | —Unverified | 0 |
| Benchmarking Practices in LLM-driven Offensive Security: Testbeds, Metrics, and Experiment Design | Apr 14, 2025 | BenchmarkingLanguage Modeling | —Unverified | 0 |
| NoTeS-Bank: Benchmarking Neural Transcription and Search for Scientific Notes Understanding | Apr 12, 2025 | BenchmarkingDocument AI | —Unverified | 0 |
| SortBench: Benchmarking LLMs based on their ability to sort lists | Apr 11, 2025 | Benchmarking | —Unverified | 0 |
| TP-RAG: Benchmarking Retrieval-Augmented Large Language Model Agents for Spatiotemporal-Aware Travel Planning | Apr 11, 2025 | BenchmarkingLanguage Modeling | —Unverified | 0 |
| Geological Inference from Textual Data using Word Embeddings | Apr 10, 2025 | BenchmarkingWord Embeddings | CodeCode Available | 0 |
| Benchmarking Adversarial Robustness to Bias Elicitation in Large Language Models: Scalable Automated Assessment with LLM-as-a-Judge | Apr 10, 2025 | Adversarial RobustnessBenchmarking | CodeCode Available | 0 |
| Adaptive Shrinkage Estimation For Personalized Deep Kernel Regression In Modeling Brain Trajectories | Apr 10, 2025 | Additive modelsBenchmarking | CodeCode Available | 0 |
| NorEval: A Norwegian Language Understanding and Generation Evaluation Benchmark | Apr 10, 2025 | Benchmarking | CodeCode Available | 0 |
| Benchmarking Suite for Synthetic Aperture Radar Imagery Anomaly Detection (SARIAD) Algorithms | Apr 10, 2025 | Anomaly DetectionBenchmarking | CodeCode Available | 0 |
| SydneyScapes: Image Segmentation for Australian Environments | Apr 10, 2025 | Autonomous VehiclesBenchmarking | —Unverified | 0 |
| Benchmarking Multi-Organ Segmentation Tools for Multi-Parametric T1-weighted Abdominal MRI | Apr 10, 2025 | BenchmarkingOrgan Segmentation | —Unverified | 0 |
| Benchmarking Image Embeddings for E-Commerce: Evaluating Off-the Shelf Foundation Models, Fine-Tuning Strategies and Practical Trade-offs | Apr 10, 2025 | BenchmarkingContrastive Learning | —Unverified | 0 |
| Benchmarking Convolutional Neural Network and Graph Neural Network based Surrogate Models on a Real-World Car External Aerodynamics Dataset | Apr 9, 2025 | BenchmarkingGraph Neural Network | —Unverified | 0 |
| Can Carbon-Aware Electric Load Shifting Reduce Emissions? An Equilibrium-Based Analysis | Apr 9, 2025 | Benchmarking | —Unverified | 0 |
| TabKAN: Advancing Tabular Data Analysis using Kolmogorov-Arnold Network | Apr 9, 2025 | BenchmarkingDeep Learning | —Unverified | 0 |
| RayFronts: Open-Set Semantic Ray Frontiers for Online Scene Understanding and Exploration | Apr 9, 2025 | 3D Semantic SegmentationBenchmarking | —Unverified | 0 |
| Benchmarking Multimodal CoT Reward Model Stepwise by Visual Program | Apr 9, 2025 | Benchmarking | CodeCode Available | 0 |
| A Roadmap for Improving Data Reliability and Sharing in Crosslinking Mass Spectrometry | Apr 9, 2025 | Benchmarking | —Unverified | 0 |