| BSBench: will your LLM find the largest prime number? | Jun 5, 2025 | Benchmarking | CodeCode Available | 0 |
| Debatable Intelligence: Benchmarking LLM Judges via Debate Speech Evaluation | Jun 5, 2025 | Benchmarking | CodeCode Available | 0 |
| Design of intelligent proofreading system for English translation based on CNN and BERT | Jun 5, 2025 | BenchmarkingMachine Translation | —Unverified | 0 |
| DIMCIM: A Quantitative Evaluation Framework for Default-mode Diversity and Generalization in Text-to-Image Generative Models | Jun 5, 2025 | BenchmarkingDiversity | —Unverified | 0 |
| VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos | Jun 5, 2025 | BenchmarkingMathematical Reasoning | —Unverified | 0 |
| MegaHan97K: A Large-Scale Dataset for Mega-Category Chinese Character Recognition with over 97K Categories | Jun 5, 2025 | BenchmarkingOptical Character Recognition | CodeCode Available | 2 |
| HoliSafe: Holistic Safety Benchmarking and Modeling with Safety Meta Token for Vision-Language Model | Jun 5, 2025 | BenchmarkingLanguage Modeling | —Unverified | 0 |
| From Standalone LLMs to Integrated Intelligence: A Survey of Compound Al Systems | Jun 5, 2025 | BenchmarkingRAG | —Unverified | 0 |
| CzechLynx: A Dataset for Individual Identification and Pose Estimation of the Eurasian Lynx | Jun 5, 2025 | 2D Pose EstimationBenchmarking | —Unverified | 0 |
| A Unified Framework for Provably Efficient Algorithms to Estimate Shapley Values | Jun 5, 2025 | Benchmarking | —Unverified | 0 |
| Urania: Differentially Private Insights into AI Use | Jun 5, 2025 | BenchmarkingChatbot | —Unverified | 0 |
| AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMs | Jun 5, 2025 | BenchmarkingVideo Understanding | —Unverified | 0 |
| Seeing in the Dark: Benchmarking Egocentric 3D Vision with the Oxford Day-and-Night Dataset | Jun 4, 2025 | 3D geometryBenchmarking | —Unverified | 0 |
| Knowledge-guided Contextual Gene Set Analysis Using Large Language Models | Jun 4, 2025 | Benchmarking | —Unverified | 0 |
| MELABenchv1: Benchmarking Large Language Models against Smaller Fine-Tuned Models for Low-Resource Maltese NLP | Jun 4, 2025 | BenchmarkingLanguage Modelling | —Unverified | 0 |
| HSSBench: Benchmarking Humanities and Social Sciences Ability for Multimodal Large Language Models | Jun 4, 2025 | BenchmarkingGeneral Knowledge | CodeCode Available | 0 |
| MedAgentGym: Training LLM Agents for Code-Based Medical Reasoning at Scale | Jun 4, 2025 | BenchmarkingLanguage Modeling | —Unverified | 0 |
| macOSWorld: A Multilingual Interactive Benchmark for GUI Agents | Jun 4, 2025 | BenchmarkingDomain Adaptation | CodeCode Available | 1 |
| A Kernel-Based Approach for Accurate Steady-State Detection in Performance Time Series | Jun 4, 2025 | BenchmarkingIrregular Time Series | CodeCode Available | 0 |
| AssetOpsBench: Benchmarking AI Agents for Task Automation in Industrial Asset Operations and Maintenance | Jun 4, 2025 | BenchmarkingScheduling | CodeCode Available | 5 |
| Curse of Slicing: Why Sliced Mutual Information is a Deceptive Measure of Statistical Dependence | Jun 4, 2025 | Benchmarking | —Unverified | 0 |
| Generating Automotive Code: Large Language Models for Software Development and Verification in Safety-Critical Systems | Jun 4, 2025 | BenchmarkingCode Generation | —Unverified | 0 |
| CETBench: A Novel Dataset constructed via Transformations over Programs for Benchmarking LLMs for Code-Equivalence Checking | Jun 4, 2025 | BenchmarkingCode Generation | —Unverified | 0 |
| N^2: A Unified Python Package and Test Bench for Nearest Neighbor-Based Matrix Completion | Jun 4, 2025 | BenchmarkingCausal Inference | CodeCode Available | 0 |
| ByteMorph: Benchmarking Instruction-Guided Image Editing with Non-Rigid Motions | Jun 3, 2025 | BenchmarkingDiversity | CodeCode Available | 1 |
| AMLgentex: Mobilizing Data-Driven Research to Combat Money Laundering | Jun 3, 2025 | Benchmarking | —Unverified | 0 |
| FailureSensorIQ: A Multi-Choice QA Dataset for Understanding Sensor Relationships and Failure Modes | Jun 3, 2025 | BenchmarkingFeature Engineering | CodeCode Available | 0 |
| Tactile MNIST: Benchmarking Active Tactile Perception | Jun 3, 2025 | BenchmarkingScene Understanding | —Unverified | 0 |
| FlowerTune: A Cross-Domain Benchmark for Federated Fine-Tuning of Large Language Models | Jun 3, 2025 | BenchmarkingDomain Adaptation | —Unverified | 0 |
| SVGenius: Benchmarking LLMs in SVG Understanding, Editing and Generation | Jun 3, 2025 | BenchmarkingStyle Transfer | —Unverified | 0 |
| NetPress: Dynamically Generated LLM Benchmarks for Network Applications | Jun 3, 2025 | Benchmarking | CodeCode Available | 1 |
| Rethinking Machine Unlearning in Image Generation Models | Jun 3, 2025 | BenchmarkingImage Generation | CodeCode Available | 1 |
| FormFactory: An Interactive Benchmarking Suite for Multimodal Form-Filling Agents | Jun 2, 2025 | BenchmarkingForm | —Unverified | 0 |
| CVC: A Large-Scale Chinese Value Rule Corpus for Value Alignment of Large Language Models | Jun 2, 2025 | Benchmarking | CodeCode Available | 0 |
| TIIF-Bench: How Does Your T2I Model Follow Your Instructions? | Jun 2, 2025 | BenchmarkingInstruction Following | —Unverified | 0 |
| Benchmarking Neural Speech Codec Intelligibility with SITool | Jun 2, 2025 | BenchmarkingDiagnostic | —Unverified | 0 |
| ResearchCodeBench: Benchmarking LLMs on Implementing Novel Machine Learning Research Code | Jun 2, 2025 | BenchmarkingCode Generation | —Unverified | 0 |
| ExpertLongBench: Benchmarking Language Models on Expert-Level Long-Form Generation Tasks with Structured Checklists | Jun 2, 2025 | BenchmarkingForm | —Unverified | 0 |
| GSCodec Studio: A Modular Framework for Gaussian Splat Compression | Jun 2, 2025 | Benchmarking | CodeCode Available | 2 |
| Greening AI-enabled Systems with Software Engineering: A Research Agenda for Environmentally Sustainable AI Practices | Jun 2, 2025 | Benchmarking | —Unverified | 0 |
| ModuLM: Enabling Modular and Multimodal Molecular Relational Learning with Large Language Models | Jun 1, 2025 | BenchmarkingRelational Reasoning | —Unverified | 0 |
| CODEMENV: Benchmarking Large Language Models on Code Migration | Jun 1, 2025 | Benchmarking | CodeCode Available | 1 |
| ACCESS DENIED INC: The First Benchmark Environment for Sensitivity Awareness | Jun 1, 2025 | BenchmarkingManagement | CodeCode Available | 0 |
| MedBookVQA: A Systematic and Comprehensive Medical Benchmark Derived from Open-Access Book | Jun 1, 2025 | Benchmarking | CodeCode Available | 0 |
| The iNaturalist Sounds Dataset | May 31, 2025 | Benchmarking | —Unverified | 0 |
| AVROBUSTBENCH: Benchmarking the Robustness of Audio-Visual Recognition Models at Test-Time | May 31, 2025 | BenchmarkingTest-time Adaptation | CodeCode Available | 1 |
| Breakpoint: Scalable evaluation of system-level reasoning in LLM code agents | May 30, 2025 | BenchmarkingCode Repair | —Unverified | 0 |
| PathGene: Benchmarking Driver Gene Mutations and Exon Prediction Using Multicenter Lung Cancer Histopathology Image Dataset | May 30, 2025 | BenchmarkingMultiple Instance Learning | CodeCode Available | 0 |
| Beyond Atomic Geometry Representations in Materials Science: A Human-in-the-Loop Multimodal Framework | May 30, 2025 | Benchmarking | CodeCode Available | 0 |
| GenSpace: Benchmarking Spatially-Aware Image Generation | May 30, 2025 | BenchmarkingImage Generation | —Unverified | 0 |