| From Perfect to Noisy World Simulation: Customizable Embodied Multi-modal Perturbations for SLAM Robustness Benchmarking | Jun 24, 2024 | BenchmarkingNeRF | CodeCode Available | 2 |
| FluidLab: A Differentiable Environment for Benchmarking Complex Fluid Manipulation | Mar 4, 2023 | BenchmarkingGPU | CodeCode Available | 2 |
| LLM-Based Multi-Agent Systems are Scalable Graph Generative Models | Oct 13, 2024 | BenchmarkingGraph Generation | CodeCode Available | 2 |
| Assessing SPARQL capabilities of Large Language Models | Sep 9, 2024 | BenchmarkingKnowledge Graphs | CodeCode Available | 2 |
| DreamBench++: A Human-Aligned Benchmark for Personalized Image Generation | Jun 24, 2024 | BenchmarkingImage Generation | CodeCode Available | 2 |
| GDGB: A Benchmark for Generative Dynamic Text-Attributed Graph Learning | Jul 4, 2025 | BenchmarkingGraph Generation | CodeCode Available | 2 |
| State-specific protein-ligand complex structure prediction with a multi-scale deep generative model | Sep 30, 2022 | BenchmarkingBlind Docking | CodeCode Available | 2 |
| AutoPenBench: Benchmarking Generative Agents for Penetration Testing | Oct 4, 2024 | Benchmarking | CodeCode Available | 2 |
| GenoTEX: An LLM Agent Benchmark for Automated Gene Expression Data Analysis | Jun 21, 2024 | AI AgentAutoML | CodeCode Available | 2 |
| AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator | Feb 15, 2024 | BenchmarkingDiagnostic | CodeCode Available | 2 |
| BARS: Towards Open Benchmarking for Recommender Systems | May 19, 2022 | BenchmarkingClick-Through Rate Prediction | CodeCode Available | 2 |
| GPT-Fathom: Benchmarking Large Language Models to Decipher the Evolutionary Path towards GPT-4 and Beyond | Sep 28, 2023 | Benchmarking | CodeCode Available | 2 |
| GSCodec Studio: A Modular Framework for Gaussian Splat Compression | Jun 2, 2025 | Benchmarking | CodeCode Available | 2 |
| GSplatLoc: Grounding Keypoint Descriptors into 3D Gaussian Splatting for Improved Visual Localization | Sep 24, 2024 | 3D geometry3DGS | CodeCode Available | 2 |
| Desbordante: from benchmarking suite to high-performance science-intensive data profiler (preprint) | Jan 14, 2023 | Benchmarking | CodeCode Available | 2 |
| MultiPL-E: A Scalable and Extensible Approach to Benchmarking Neural Code Generation | Aug 17, 2022 | BenchmarkingCode Generation | CodeCode Available | 2 |
| AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension | Feb 12, 2024 | 2kAutomatic Speech Recognition | CodeCode Available | 2 |
| Deep Visual Geo-localization Benchmark | Apr 7, 2022 | BenchmarkingData Augmentation | CodeCode Available | 2 |
| DrafterBench: Benchmarking Large Language Models for Tasks Automation in Civil Engineering | Jul 15, 2025 | BenchmarkingInstruction Following | CodeCode Available | 2 |
| EasyTPP: Towards Open Benchmarking Temporal Point Processes | Jul 16, 2023 | BenchmarkingPoint Processes | CodeCode Available | 2 |
| IML-ViT: Benchmarking Image Manipulation Localization by Vision Transformer | Jul 27, 2023 | BenchmarkingImage Manipulation | CodeCode Available | 2 |
| AiTLAS: Artificial Intelligence Toolbox for Earth Observation | Jan 21, 2022 | BenchmarkingEarth Observation | CodeCode Available | 2 |
| Benchmarking Agentic Workflow Generation | Oct 10, 2024 | Benchmarking | CodeCode Available | 2 |
| InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents | Mar 5, 2024 | BenchmarkingLanguage Modeling | CodeCode Available | 2 |
| InstructLayout: Instruction-Driven 2D and 3D Layout Synthesis with Semantic Graph Prior | Jul 10, 2024 | BenchmarkingDecoder | CodeCode Available | 2 |
| InstructScene: Instruction-Driven 3D Indoor Scene Synthesis with Semantic Graph Prior | Feb 7, 2024 | BenchmarkingDecoder | CodeCode Available | 2 |
| Aria Digital Twin: A New Benchmark Dataset for Egocentric 3D Machine Perception | Jun 10, 2023 | 3D Object DetectionBenchmarking | CodeCode Available | 2 |
| A large annotated medical image dataset for the development and evaluation of segmentation algorithms | Feb 25, 2019 | BenchmarkingSegmentation | CodeCode Available | 2 |
| IntPhys 2: Benchmarking Intuitive Physics Understanding In Complex Synthetic Environments | Jun 11, 2025 | Benchmarking | CodeCode Available | 2 |
| Investigating Tradeoffs in Real-World Video Super-Resolution | Nov 24, 2021 | BenchmarkingSuper-Resolution | CodeCode Available | 2 |
| Datasets and Benchmarks for Offline Safe Reinforcement Learning | Jun 15, 2023 | Autonomous DrivingBenchmarking | CodeCode Available | 2 |
| K-LITE: Learning Transferable Visual Models with External Knowledge | Apr 20, 2022 | BenchmarkingDescriptive | CodeCode Available | 2 |
| LaMAR: Benchmarking Localization and Mapping for Augmented Reality | Oct 19, 2022 | BenchmarkingDiversity | CodeCode Available | 2 |
| Customizable Perturbation Synthesis for Robust SLAM Benchmarking | Feb 12, 2024 | BenchmarkingSimultaneous Localization and Mapping | CodeCode Available | 2 |
| CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and Interactions | May 24, 2025 | Benchmarking | CodeCode Available | 2 |
| Benchmarking and Improving Detail Image Caption | May 29, 2024 | BenchmarkingImage Captioning | CodeCode Available | 2 |
| DaisyRec 2.0: Benchmarking Recommendation for Rigorous Evaluation | Jun 22, 2022 | BenchmarkingRecommendation Systems | CodeCode Available | 2 |
| Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA | Jun 25, 2024 | BenchmarkingLong-Context Understanding | CodeCode Available | 2 |
| Decouple and Track: Benchmarking and Improving Video Diffusion Transformers for Motion Transfer | Mar 21, 2025 | BenchmarkingVideo Generation | CodeCode Available | 2 |
| LLaMEA-BO: A Large Language Model Evolutionary Algorithm for Automatically Generating Bayesian Optimization Algorithms | May 27, 2025 | Bayesian OptimizationBenchmarking | CodeCode Available | 2 |
| CORAL: Benchmarking Multi-turn Conversational Retrieval-Augmentation Generation | Oct 30, 2024 | BenchmarkingPassage Retrieval | CodeCode Available | 2 |
| Craftium: An Extensible Framework for Creating Reinforcement Learning Environments | Jul 4, 2024 | BenchmarkingMinecraft | CodeCode Available | 2 |
| CoqPilot, a plugin for LLM-based generation of proofs | Oct 25, 2024 | Benchmarking | CodeCode Available | 2 |
| COALA: A Practical and Vision-Centric Federated Learning Platform | Jul 23, 2024 | BenchmarkingContinual Learning | CodeCode Available | 2 |
| LtU-ILI: An All-in-One Framework for Implicit Inference in Astrophysics and Cosmology | Feb 6, 2024 | AllBenchmarking | CodeCode Available | 2 |
| LV-Eval: A Balanced Long-Context Benchmark with 5 Length Levels Up to 256K | Feb 6, 2024 | 16kBenchmarking | CodeCode Available | 2 |
| Class-incremental Learning for Time Series: Benchmark and Evaluation | Feb 19, 2024 | Activity RecognitionBenchmarking | CodeCode Available | 2 |
| MathOdyssey: Benchmarking Mathematical Problem-Solving Skills in Large Language Models Using Odyssey Math Data | Jun 26, 2024 | BenchmarkingMath | CodeCode Available | 2 |
| MedAgentsBench: Benchmarking Thinking Models and Agent Frameworks for Complex Medical Reasoning | Mar 10, 2025 | BenchmarkingMedical Question Answering | CodeCode Available | 2 |
| ClimateLearn: Benchmarking Machine Learning for Weather and Climate Modeling | Jul 4, 2023 | BenchmarkingWeather Forecasting | CodeCode Available | 2 |