| Underwater Image Restoration Through a Prior Guided Hybrid Sense Approach and Extensive Benchmark Analysis | Jan 6, 2025 | BenchmarkingImage Enhancement | CodeCode Available | 1 |
| Tougher Text, Smarter Models: Raising the Bar for Adversarial Defence Benchmarks | Jan 5, 2025 | Adversarial RobustnessBenchmarking | CodeCode Available | 0 |
| ANTHROPOS-V: benchmarking the novel task of Crowd Volume Estimation | Jan 3, 2025 | BenchmarkingCrowd Counting | CodeCode Available | 0 |
| AI-Powered Cow Detection in Complex Farm Environments | Jan 3, 2025 | Benchmarking | —Unverified | 0 |
| QuArch: A Question-Answering Dataset for AI Agents in Computer Architecture | Jan 3, 2025 | BenchmarkingQuestion Answering | —Unverified | 0 |
| PSYCHE: A Multi-faceted Patient Simulation Framework for Evaluation of Psychiatric Assessment Conversational Agents | Jan 3, 2025 | Benchmarking | —Unverified | 0 |
| Benchmarking Constraint-Based Bayesian Structure Learning Algorithms: Role of Network Topology | Jan 2, 2025 | BenchmarkingSensitivity | —Unverified | 0 |
| BoxingGym: Benchmarking Progress in Automated Experimental Design and Model Discovery | Jan 2, 2025 | BenchmarkingExperimental Design | CodeCode Available | 0 |
| CySecBench: Generative AI-based CyberSecurity-focused Prompt Dataset for Benchmarking Large Language Models | Jan 2, 2025 | BenchmarkingComputer Security | CodeCode Available | 1 |
| CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings | Jan 2, 2025 | BenchmarkingCode Generation | —Unverified | 0 |
| MSC-Bench: Benchmarking and Analyzing Multi-Sensor Corruption for Driving Perception | Jan 2, 2025 | 3D Object DetectionAutonomous Driving | —Unverified | 0 |
| TabTreeFormer: Tabular Data Generation Using Hybrid Tree-Transformer | Jan 2, 2025 | BenchmarkingQuantization | —Unverified | 0 |
| State-of-the-art AI-based Learning Approaches for Deepfake Generation and Detection, Analyzing Opportunities, Threading through Pros, Cons, and Future Prospects | Jan 2, 2025 | BenchmarkingFace Swapping | —Unverified | 0 |
| CheXwhatsApp: A Dataset for Exploring Challenges in the Diagnosis of Chest X-rays through Mobile Devices | Jan 1, 2025 | Benchmarking | —Unverified | 0 |
| Six-CD: Benchmarking Concept Removals for Text-to-image Diffusion Models | Jan 1, 2025 | Benchmarking | —Unverified | 0 |
| InterAct: Advancing Large-Scale Versatile 3D Human-Object Interaction Generation | Jan 1, 2025 | BenchmarkingHuman-Object Interaction Detection | —Unverified | 0 |
| Segmenting Maxillofacial Structures in CBCT Volumes | Jan 1, 2025 | AnatomyBenchmarking | —Unverified | 0 |
| SVLTA: Benchmarking Vision-Language Temporal Alignment via Synthetic Video Situation | Jan 1, 2025 | BenchmarkingDiagnostic | —Unverified | 0 |
| Sketchtopia: A Dataset and Foundational Agents for Benchmarking Asynchronous Multimodal Communication with Iconic Feedback | Jan 1, 2025 | Benchmarking | —Unverified | 0 |
| CroCoDL: Cross-device Collaborative Dataset for Localization | Jan 1, 2025 | BenchmarkingPose Estimation | —Unverified | 0 |
| RCP-Bench: Benchmarking Robustness for Collaborative Perception Under Diverse Corruptions | Jan 1, 2025 | Benchmarking | CodeCode Available | 0 |
| CholecTrack20: A Multi-Perspective Tracking Dataset for Surgical Tools | Jan 1, 2025 | Benchmarking | —Unverified | 0 |
| nnWNet: Rethinking the Use of Transformers in Biomedical Image Segmentation and Calling for a Unified Evaluation Benchmark | Jan 1, 2025 | BenchmarkingImage Segmentation | CodeCode Available | 2 |
| On the Utility of Equivariance and Symmetry Breaking in Deep Learning Architectures on Point Clouds | Jan 1, 2025 | Benchmarking | —Unverified | 0 |
| Geometry Matters: Benchmarking Scientific ML Approaches for Flow Prediction around Complex Geometries | Dec 31, 2024 | BenchmarkingOut-of-Distribution Generalization | —Unverified | 0 |
| OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning | Dec 31, 2024 | BenchmarkingLogical Reasoning | CodeCode Available | 4 |
| A review of faithfulness metrics for hallucination assessment in Large Language Models | Dec 31, 2024 | BenchmarkingHallucination | —Unverified | 0 |
| AraSTEM: A Native Arabic Multiple Choice Question Benchmark for Evaluating LLMs Knowledge In STEM Subjects | Dec 31, 2024 | BenchmarkingMultiple-choice | —Unverified | 0 |
| Measuring Large Language Models Capacity to Annotate Journalistic Sourcing | Dec 30, 2024 | BenchmarkingEthics | —Unverified | 0 |
| TrajLearn: Trajectory Prediction Learning using Deep Generative Models | Dec 30, 2024 | Autonomous NavigationBenchmarking | CodeCode Available | 1 |
| UnrealZoo: Enriching Photo-realistic Virtual Worlds for Embodied AI | Dec 30, 2024 | BenchmarkingReinforcement Learning (RL) | —Unverified | 0 |
| SecBench: A Comprehensive Multi-Dimensional Benchmarking Dataset for LLMs in Cybersecurity | Dec 30, 2024 | BenchmarkingCode Generation | —Unverified | 0 |
| Stratify: Unifying Multi-Step Forecasting Strategies | Dec 29, 2024 | Benchmarking | —Unverified | 0 |
| On dataset transferability in medical image classification | Dec 28, 2024 | BenchmarkingClassification | CodeCode Available | 0 |
| Towards Ideal Temporal Graph Neural Networks: Evaluations and Conclusions after 10,000 GPU Hours | Dec 28, 2024 | BenchmarkingGPU | —Unverified | 0 |
| Machine Generated Product Advertisements: Benchmarking LLMs Against Human Performance | Dec 27, 2024 | BenchmarkingPersuasiveness | —Unverified | 0 |
| How Propense Are Large Language Models at Producing Code Smells? A Benchmarking Study | Dec 25, 2024 | BenchmarkingCode Generation | —Unverified | 0 |
| The Jungle of Generative Drug Discovery: Traps, Treasures, and Ways Out | Dec 24, 2024 | BenchmarkingDeep Learning | —Unverified | 0 |
| Re-assessing ImageNet: How aligned is its single-label assumption with its multi-label nature? | Dec 24, 2024 | Benchmarking | —Unverified | 0 |
| MixMAS: A Framework for Sampling-Based Mixer Architecture Search for Multimodal Fusion and Learning | Dec 24, 2024 | Benchmarking | CodeCode Available | 0 |
| A Deep Reinforcement Learning Framework for Dynamic Portfolio Optimization: Evidence from China's Stock Market | Dec 24, 2024 | BenchmarkingDecision Making | CodeCode Available | 0 |
| Factuality or Fiction? Benchmarking Modern LLMs on Ambiguous QA with Citations | Dec 23, 2024 | BenchmarkingQuestion Answering | —Unverified | 0 |
| StructTest: Benchmarking LLMs' Reasoning through Compositional Structured Outputs | Dec 23, 2024 | BenchmarkingLogical Reasoning | —Unverified | 0 |
| Benchmarking Generative AI Models for Deep Learning Test Input Generation | Dec 23, 2024 | BenchmarkingDeep Learning | CodeCode Available | 0 |
| Multimodal Deep Reinforcement Learning for Portfolio Optimization | Dec 23, 2024 | ArticlesBenchmarking | —Unverified | 0 |
| SCBench: A Sports Commentary Benchmark for Video LLMs | Dec 23, 2024 | Benchmarking | —Unverified | 0 |
| SMAC-Hard: Enabling Mixed Opponent Strategy Script and Self-play on SMAC | Dec 23, 2024 | BenchmarkingMulti-agent Reinforcement Learning | CodeCode Available | 1 |
| On the Generalization Ability of Machine-Generated Text Detectors | Dec 23, 2024 | BenchmarkingMisinformation | CodeCode Available | 1 |
| Chumor 2.0: Towards Benchmarking Chinese Humor Understanding | Dec 23, 2024 | Benchmarking | CodeCode Available | 0 |
| Dora: Sampling and Benchmarking for 3D Shape Variational Auto-Encoders | Dec 23, 2024 | 3D Shape ModelingBenchmarking | CodeCode Available | 4 |