| ConsumerBench: Benchmarking Generative AI Applications on End-User Devices | Jun 21, 2025 | BenchmarkingCPU | CodeCode Available | 1 |
| InstructTTSEval: Benchmarking Complex Natural-Language Instruction Following in Text-to-Speech Systems | Jun 19, 2025 | BenchmarkingDescriptive | CodeCode Available | 1 |
| GUI-Robust: A Comprehensive Dataset for Testing GUI Agent Robustness in Real-World Anomalies | Jun 17, 2025 | Benchmarking | CodeCode Available | 1 |
| The Price of Freedom: Exploring Expressivity and Runtime Tradeoffs in Equivariant Tensor Products | Jun 16, 2025 | Benchmarking | CodeCode Available | 1 |
| ComplexBench-Edit: Benchmarking Complex Instruction-Driven Image Editing via Compositional Dependencies | Jun 15, 2025 | Benchmarking | CodeCode Available | 1 |
| GLGENN: A Novel Parameter-Light Equivariant Neural Networks Architecture Based on Clifford Geometric Algebras | Jun 11, 2025 | Benchmarking | CodeCode Available | 1 |
| Attention, Please! Revisiting Attentive Probing for Masked Image Modeling | Jun 11, 2025 | BenchmarkingComputational Efficiency | CodeCode Available | 1 |
| CounselBench: A Large-Scale Expert Evaluation and Adversarial Benchmark of Large Language Models in Mental Health Counseling | Jun 10, 2025 | Benchmarking | CodeCode Available | 1 |
| scSSL-Bench: Benchmarking Self-Supervised Learning for Single-Cell Data | Jun 10, 2025 | BenchmarkingData Augmentation | CodeCode Available | 1 |
| RADAR: Benchmarking Language Models on Imperfect Tabular Data | Jun 9, 2025 | BenchmarkingMissing Values | CodeCode Available | 1 |
| FinanceReasoning: Benchmarking Financial Numerical Reasoning More Credible, Comprehensive and Challenging | Jun 6, 2025 | Benchmarking | CodeCode Available | 1 |
| MMTU: A Massive Multi-Task Table Understanding and Reasoning Benchmark | Jun 5, 2025 | Benchmarking | CodeCode Available | 1 |
| macOSWorld: A Multilingual Interactive Benchmark for GUI Agents | Jun 4, 2025 | BenchmarkingDomain Adaptation | CodeCode Available | 1 |
| Rethinking Machine Unlearning in Image Generation Models | Jun 3, 2025 | BenchmarkingImage Generation | CodeCode Available | 1 |
| NetPress: Dynamically Generated LLM Benchmarks for Network Applications | Jun 3, 2025 | Benchmarking | CodeCode Available | 1 |
| ByteMorph: Benchmarking Instruction-Guided Image Editing with Non-Rigid Motions | Jun 3, 2025 | BenchmarkingDiversity | CodeCode Available | 1 |
| CODEMENV: Benchmarking Large Language Models on Code Migration | Jun 1, 2025 | Benchmarking | CodeCode Available | 1 |
| AVROBUSTBENCH: Benchmarking the Robustness of Audio-Visual Recognition Models at Test-Time | May 31, 2025 | BenchmarkingTest-time Adaptation | CodeCode Available | 1 |
| Bench4KE: Benchmarking Automated Competency Question Generation | May 30, 2025 | BenchmarkingQuestion Generation | CodeCode Available | 1 |
| Draw ALL Your Imagine: A Holistic Benchmark and Agent Framework for Complex Instruction-based Image Generation | May 30, 2025 | AllBenchmarking | CodeCode Available | 1 |
| ByzFL: Research Framework for Robust Federated Learning | May 30, 2025 | BenchmarkingFederated Learning | CodeCode Available | 1 |
| Toward Memory-Aided World Models: Benchmarking via Spatial Consistency | May 29, 2025 | BenchmarkingMinecraft | CodeCode Available | 1 |
| Scalable Parameter and Memory Efficient Pretraining for LLM: Recent Algorithmic Advances and Benchmarking | May 28, 2025 | Benchmarking | CodeCode Available | 1 |
| SVRPBench: A Realistic Benchmark for Stochastic Vehicle Routing Problem | May 28, 2025 | Benchmarking | CodeCode Available | 1 |
| GoMatching++: Parameter- and Data-Efficient Arbitrary-Shaped Video Text Spotting and Benchmarking | May 28, 2025 | BenchmarkingText Spotting | CodeCode Available | 1 |
| RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS Environments | May 28, 2025 | BenchmarkingRed Teaming | CodeCode Available | 1 |
| Bencher: Simple and Reproducible Benchmarking for Black-Box Optimization | May 27, 2025 | Benchmarking | CodeCode Available | 1 |
| FM-Planner: Foundation Model Guided Path Planning for Autonomous Drone Navigation | May 27, 2025 | BenchmarkingDecision Making | CodeCode Available | 1 |
| Benchmarking Multimodal Knowledge Conflict for Large Multimodal Models | May 26, 2025 | BenchmarkingRAG | CodeCode Available | 1 |
| OB3D: A New Dataset for Benchmarking Omnidirectional 3D Reconstruction Using Blender | May 26, 2025 | 3DGS3D Reconstruction | CodeCode Available | 1 |
| MineAnyBuild: Benchmarking Spatial Planning for Open-world AI Agents | May 26, 2025 | BenchmarkingMinecraft | CodeCode Available | 1 |
| Are Vision Language Models Ready for Clinical Diagnosis? A 3D Medical Benchmark for Tumor-centric Visual Question Answering | May 25, 2025 | AnatomyBenchmarking | CodeCode Available | 1 |
| SeePhys: Does Seeing Help Thinking? -- Benchmarking Vision-Based Physics Reasoning | May 25, 2025 | BenchmarkingVisual Reasoning | CodeCode Available | 1 |
| FullFront: Benchmarking MLLMs Across the Full Front-End Engineering Workflow | May 23, 2025 | BenchmarkingCode Generation | CodeCode Available | 1 |
| Semantic Correspondence: Unified Benchmarking and a Strong Baseline | May 23, 2025 | BenchmarkingSemantic correspondence | CodeCode Available | 1 |
| Twin-2K-500: A dataset for building digital twins of over 2,000 people based on their answers to over 500 questions | May 23, 2025 | 2kBenchmarking | CodeCode Available | 1 |
| Benchmarking Recommendation, Classification, and Tracing Based on Hugging Face Knowledge Graph | May 23, 2025 | BenchmarkingManagement | CodeCode Available | 1 |
| REOBench: Benchmarking Robustness of Earth Observation Foundation Models | May 22, 2025 | BenchmarkingContrastive Learning | CodeCode Available | 1 |
| EduBench: A Comprehensive Benchmarking Dataset for Evaluating Large Language Models in Diverse Educational Scenarios | May 22, 2025 | Benchmarking | CodeCode Available | 1 |
| AGENTIF: Benchmarking Instruction Following of Large Language Models in Agentic Scenarios | May 22, 2025 | BenchmarkingInstruction Following | CodeCode Available | 1 |
| Benchmarking Retrieval-Augmented Multimomal Generation for Document Question Answering | May 22, 2025 | BenchmarkingEvidence Selection | CodeCode Available | 1 |
| DiagnosisArena: Benchmarking Diagnostic Reasoning for Large Language Models | May 20, 2025 | BenchmarkingDiagnostic | CodeCode Available | 1 |
| TxPert: Leveraging Biochemical Relationships for Out-of-Distribution Transcriptomic Perturbation Prediction | May 20, 2025 | BenchmarkingKnowledge Graphs | CodeCode Available | 1 |
| Ineq-Comp: Benchmarking Human-Intuitive Compositional Reasoning in Automated Theorem Proving on Inequalities | May 19, 2025 | Automated Theorem ProvingBenchmarking | CodeCode Available | 1 |
| TimeSeriesGym: A Scalable Benchmark for (Time Series) Machine Learning Engineering Agents | May 19, 2025 | AI AgentBenchmarking | CodeCode Available | 1 |
| Decentralized Arena: Towards Democratic and Scalable Automatic Evaluation of Language Models | May 19, 2025 | BenchmarkingChatbot | CodeCode Available | 1 |
| MedAgentBoard: Benchmarking Multi-Agent Collaboration with Conventional Methods for Diverse Medical Tasks | May 18, 2025 | BenchmarkingMedical Visual Question Answering | CodeCode Available | 1 |
| What are they talking about? Benchmarking Large Language Models for Knowledge-Grounded Discussion Summarization | May 18, 2025 | Benchmarking | CodeCode Available | 1 |
| LOVE: Benchmarking and Evaluating Text-to-Video Generation and Video-to-Text Interpretation | May 17, 2025 | BenchmarkingQuestion Answering | CodeCode Available | 1 |
| Massive-STEPS: Massive Semantic Trajectories for Understanding POI Check-ins -- Dataset and Benchmarks | May 16, 2025 | Benchmarking | CodeCode Available | 1 |