| DiffuSETS: 12-lead ECG Generation Conditioned on Clinical Text Reports and Patient-Specific Information | Jan 10, 2025 | BenchmarkingData Augmentation | CodeCode Available | 1 | 5 |
| DexArt: Benchmarking Generalizable Dexterous Manipulation with Articulated Objects | May 9, 2023 | BenchmarkingDecision Making | CodeCode Available | 1 | 5 |
| Detecting beats in the photoplethysmogram: benchmarking open-source algorithms | Jul 19, 2022 | BenchmarkingPhotoplethysmography (PPG) beat detection | CodeCode Available | 1 | 5 |
| DFGC 2021: A DeepFake Game Competition | Jun 2, 2021 | BenchmarkingDeepFake Detection | CodeCode Available | 1 | 5 |
| DIG In: Evaluating Disparities in Image Generations with Indicators for Geographic Diversity | Aug 11, 2023 | BenchmarkingDiversity | CodeCode Available | 1 | 5 |
| Benchmarking Large Language Models on CMExam -- A Comprehensive Chinese Medical Exam Dataset | Jun 5, 2023 | BenchmarkingMultiple-choice | CodeCode Available | 1 | 5 |
| A Computed Tomography Vertebral Segmentation Dataset with Anatomical Variations and Multi-Vendor Scanner Data | Mar 10, 2021 | AnatomyBenchmarking | CodeCode Available | 1 | 5 |
| Descending through a Crowded Valley - Benchmarking Deep Learning Optimizers | Jul 3, 2020 | BenchmarkingDeep Learning | CodeCode Available | 1 | 5 |
| Benchmarking Large Language Models for News Summarization | Jan 31, 2023 | BenchmarkingNews Summarization | CodeCode Available | 1 | 5 |
| Benchmarking LLM Faithfulness in RAG with Evolving Leaderboards | May 7, 2025 | BenchmarkingHallucination | CodeCode Available | 1 | 5 |
| Benchmarking Micro-action Recognition: Dataset, Methods, and Applications | Mar 8, 2024 | Action RecognitionBenchmarking | CodeCode Available | 1 | 5 |
| Descending through a Crowded Valley — Benchmarking Deep Learning Optimizers | Jan 1, 2021 | BenchmarkingDeep Learning | CodeCode Available | 1 | 5 |
| Digital Typhoon: Long-term Satellite Image Dataset for the Spatio-Temporal Modeling of Tropical Cyclones | Nov 5, 2023 | Benchmarking | CodeCode Available | 1 | 5 |
| Do Vision & Language Decoders use Images and Text equally? How Self-consistent are their Explanations? | Apr 29, 2024 | Answer GenerationBenchmarking | CodeCode Available | 1 | 5 |
| Benchmarking Language Models for Code Syntax Understanding | Oct 26, 2022 | Benchmarking | CodeCode Available | 1 | 5 |
| AudioMarkBench: Benchmarking Robustness of Audio Watermarking | Jun 11, 2024 | Benchmarkingtext-to-speech | CodeCode Available | 1 | 5 |
| Delving into Out-of-Distribution Detection with Medical Vision-Language Models | Mar 2, 2025 | Benchmarkingimage-classification | CodeCode Available | 1 | 5 |
| RobFR: Benchmarking Adversarial Robustness on Face Recognition | Jul 8, 2020 | Adversarial RobustnessBenchmarking | CodeCode Available | 1 | 5 |
| Benchmarking Large Language Models for Automated Verilog RTL Code Generation | Dec 13, 2022 | BenchmarkingCode Generation | CodeCode Available | 1 | 5 |
| A Large-Scale Dataset for Benchmarking Elevator Button Segmentation and Character Recognition | Mar 16, 2021 | BenchmarkingPosition | CodeCode Available | 1 | 5 |
| Benchmarking Large Language Models on Controllable Generation under Diversified Instructions | Jan 1, 2024 | BenchmarkingInstruction Following | CodeCode Available | 1 | 5 |
| Benchmarking Large Multimodal Models against Common Corruptions | Jan 22, 2024 | BenchmarkingImage to text | CodeCode Available | 1 | 5 |
| DeID-GPT: Zero-shot Medical Text De-Identification by GPT-4 | Mar 20, 2023 | BenchmarkingDe-identification | CodeCode Available | 1 | 5 |
| Benchmarking Knowledge Boundary for Large Language Models: A Different Perspective on Model Evaluation | Feb 18, 2024 | BenchmarkingLanguage Modeling | CodeCode Available | 1 | 5 |
| A Large-scale Comprehensive Dataset and Copy-overlap Aware Evaluation Protocol for Segment-level Video Copy Detection | Mar 5, 2022 | BenchmarkingCopy Detection | CodeCode Available | 1 | 5 |
| Benchmarking Large Language Models on Answering and Explaining Challenging Medical Questions | Feb 28, 2024 | BenchmarkingMultiple-choice | CodeCode Available | 1 | 5 |
| A Unified Taxonomy and Multimodal Dataset for Events in Invasion Games | Aug 25, 2021 | BenchmarkingVideo Classification | CodeCode Available | 1 | 5 |
| Benchmarking Language Model Creativity: A Case Study on Code Generation | Jul 12, 2024 | BenchmarkingCode Generation | CodeCode Available | 1 | 5 |
| Benchmarking Large Vision-Language Models via Directed Scene Graph for Comprehensive Image Captioning | Dec 11, 2024 | AttributeBenchmarking | CodeCode Available | 1 | 5 |
| A User-Centric Multi-Intent Benchmark for Evaluating Large Language Models | Apr 22, 2024 | BenchmarkingWorld Knowledge | CodeCode Available | 1 | 5 |
| DetectRL: Benchmarking LLM-Generated Text Detection in Real-World Scenarios | Oct 31, 2024 | BenchmarkingLLM-generated Text Detection | CodeCode Available | 1 | 5 |
| Developing a Scalable Benchmark for Assessing Large Language Models in Knowledge Graph Engineering | Aug 31, 2023 | BenchmarkingDataset Generation | CodeCode Available | 1 | 5 |
| Deluca -- A Differentiable Control Library: Environments, Methods, and Benchmarking | Feb 19, 2021 | BenchmarkingOpenAI Gym | CodeCode Available | 1 | 5 |
| Benchmarking Llama2, Mistral, Gemma and GPT for Factuality, Toxicity, Bias and Propensity for Hallucinations | Apr 15, 2024 | BenchmarkingBias Detection | CodeCode Available | 1 | 5 |
| DiagnosisArena: Benchmarking Diagnostic Reasoning for Large Language Models | May 20, 2025 | BenchmarkingDiagnostic | CodeCode Available | 1 | 5 |
| AutoDetect: Towards a Unified Framework for Automated Weakness Detection in Large Language Models | Jun 24, 2024 | BenchmarkingData Augmentation | CodeCode Available | 1 | 5 |
| Demystifying Learning Rate Policies for High Accuracy Training of Deep Neural Networks | Aug 18, 2019 | BenchmarkingImage Classification | CodeCode Available | 1 | 5 |
| Attention, Please! Revisiting Attentive Probing for Masked Image Modeling | Jun 11, 2025 | BenchmarkingComputational Efficiency | CodeCode Available | 1 | 5 |
| Benchmarking Implicit Neural Representation and Geometric Rendering in Real-Time RGB-D SLAM | Mar 28, 2024 | Benchmarking | CodeCode Available | 1 | 5 |
| Benchmarking Meta-embeddings: What Works and What Does Not | Nov 1, 2021 | BenchmarkingEmbeddings Evaluation | CodeCode Available | 1 | 5 |
| Benchmarking LLMs' Swarm intelligence | May 7, 2025 | Benchmarking | CodeCode Available | 1 | 5 |
| DivScene: Benchmarking LVLMs for Object Navigation with Diverse Scenes and Objects | Oct 3, 2024 | BenchmarkingImitation Learning | CodeCode Available | 1 | 5 |
| Align and Distill: Unifying and Improving Domain Adaptive Object Detection | Mar 18, 2024 | Benchmarkingobject-detection | CodeCode Available | 1 | 5 |
| Deep learning model solves change point detection for multiple change types | Apr 15, 2022 | BenchmarkingChange Point Detection | CodeCode Available | 1 | 5 |
| Deep Learning-Based Synchronization for Uplink NB-IoT | May 22, 2022 | BenchmarkingDeep Learning | CodeCode Available | 1 | 5 |
| Automated Model Design and Benchmarking of 3D Deep Learning Models for COVID-19 Detection with Chest CT Scans | Jan 14, 2021 | BenchmarkingMedical Diagnosis | CodeCode Available | 1 | 5 |
| Benchmarking Meaning Representations in Neural Semantic Parsing | Nov 1, 2020 | BenchmarkingSemantic Parsing | CodeCode Available | 1 | 5 |
| DocuMint: Docstring Generation for Python using Small Language Models | May 16, 2024 | BenchmarkingCode Generation | CodeCode Available | 1 | 5 |
| Do LLMs Recognize Your Preferences? Evaluating Personalized Preference Following in LLMs | Feb 13, 2025 | BenchmarkingRetrieval | CodeCode Available | 1 | 5 |
| A Comprehensive Study on Large-Scale Graph Training: Benchmarking and Rethinking | Oct 14, 2022 | BenchmarkingGPU | CodeCode Available | 1 | 5 |