| Qwen2.5-Coder Technical Report | Sep 18, 2024 | Code Generation | CodeCode Available | 11 |
| Better Synthetic Data by Retrieving and Transforming Existing Datasets | Apr 22, 2024 | Dataset GenerationDiversity | CodeCode Available | 7 |
| LAB: Large-Scale Alignment for ChatBots | Mar 2, 2024 | Instruction FollowingLanguage Modeling | CodeCode Available | 5 |
| DataDreamer: A Tool for Synthetic Data Generation and Reproducible LLM Workflows | Feb 16, 2024 | Synthetic Data Generation | CodeCode Available | 5 |
| GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation | May 26, 2025 | Question AnsweringSynthetic Data Generation | CodeCode Available | 4 |
| TabularARGN: A Flexible and Efficient Auto-Regressive Framework for Generating High-Fidelity Synthetic Data | Jan 21, 2025 | FairnessImputation | CodeCode Available | 4 |
| Nemotron-4 340B Technical Report | Jun 17, 2024 | Synthetic Data Generation | CodeCode Available | 4 |
| MegActor: Harness the Power of Raw Video for Vivid Portrait Animation | May 31, 2024 | Portrait AnimationStyle Transfer | CodeCode Available | 4 |
| TrueTeacher: Learning Factual Consistency Evaluation with Large Language Models | May 18, 2023 | Natural Language InferenceSynthetic Data Generation | CodeCode Available | 4 |
| FSID: Fully Synthetic Image Denoising via Procedural Scene Generation | Dec 7, 2022 | DenoisingImage Denoising | CodeCode Available | 4 |
| Cosmos-Drive-Dreams: Scalable Synthetic Driving Data Generation with World Foundation Models | Jun 10, 2025 | 3D Lane Detection3D Object Detection | CodeCode Available | 3 |
| ReasonIR: Training Retrievers for Reasoning Tasks | Apr 29, 2025 | Information RetrievalMMLU | CodeCode Available | 3 |
| Annif at SemEval-2025 Task 5: Traditional XMTC augmented by LLMs | Apr 28, 2025 | Synthetic Data Generation | CodeCode Available | 3 |
| OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis | Dec 27, 2024 | DiversitySynthetic Data Generation | CodeCode Available | 3 |
| A Survey on Deep Learning for Theorem Proving | Apr 15, 2024 | Automated Theorem ProvingDeep Learning | CodeCode Available | 3 |
| SDialog: A Python Toolkit for Synthetic Dialogue Generation and Analysis | Jun 12, 2025 | BenchmarkingDialogue Generation | CodeCode Available | 2 |
| Structural Entropy Guided Agent for Detecting and Repairing Knowledge Deficiencies in LLMs | May 12, 2025 | AI AgentKnowledge Distillation | CodeCode Available | 2 |
| SVAD: From Single Image to 3D Avatar via Synthetic Data Generation with Video Diffusion and Data Augmentation | May 8, 2025 | 3DGSData Augmentation | CodeCode Available | 2 |
| Benchmarking Synthetic Tabular Data: A Multi-Dimensional Evaluation Framework | Apr 2, 2025 | BenchmarkingSynthetic Data Generation | CodeCode Available | 2 |
| Mellow: a small audio language model for reasoning | Mar 11, 2025 | Audio captioningLanguage Modeling | CodeCode Available | 2 |
| Improved Multi-Task Brain Tumour Segmentation with Synthetic Data Augmentation | Nov 7, 2024 | Data AugmentationSynthetic Data Generation | CodeCode Available | 2 |
| Efficient LLM Scheduling by Learning to Rank | Aug 28, 2024 | BlockingChatbot | CodeCode Available | 2 |
| VGGHeads: 3D Multi Head Alignment with a Large-Scale Synthetic Dataset | Jul 25, 2024 | Head DetectionKeypoint Estimation | CodeCode Available | 2 |
| UniGen: A Unified Framework for Textual Dataset Generation Using Large Language Models | Jun 27, 2024 | AttributeBenchmarking | CodeCode Available | 2 |
| SynRS3D: A Synthetic Dataset for Global 3D Semantic Understanding from Monocular Remote Sensing Imagery | Jun 26, 2024 | Domain AdaptationEarth Observation | CodeCode Available | 2 |
| A Synthetic Dataset for Personal Attribute Inference | Jun 11, 2024 | AttributeAuthor Profiling | CodeCode Available | 2 |
| End-to-End Full-Page Optical Music Recognition for Pianoform Sheet Music | May 20, 2024 | Synthetic Data Generation | CodeCode Available | 2 |
| Pedagogical Alignment of Large Language Models | Feb 7, 2024 | Synthetic Data Generation | CodeCode Available | 2 |
| UAVD4L: A Large-Scale Dataset for UAV 6-DoF Localization | Jan 11, 2024 | Synthetic Data GenerationVisual Localization | CodeCode Available | 2 |
| Predict, Refine, Synthesize: Self-Guiding Diffusion Models for Probabilistic Time Series Forecasting | Jul 21, 2023 | ImputationProbabilistic Time Series Forecasting | CodeCode Available | 2 |
| Improving 2D Human Pose Estimation in Rare Camera Views with Synthetic Data | Jul 13, 2023 | 2D Human Pose EstimationPose Estimation | CodeCode Available | 2 |
| InPars Toolkit: A Unified and Reproducible Synthetic Data Generation Pipeline for Neural Information Retrieval | Jul 10, 2023 | GPUInformation Retrieval | CodeCode Available | 2 |
| BEDLAM: A Synthetic Dataset of Bodies Exhibiting Detailed Lifelike Animated Motion | Jun 29, 2023 | Synthetic Data Generation | CodeCode Available | 2 |
| TSGM: A Flexible Framework for Generative Modeling of Synthetic Time Series | May 19, 2023 | DiversitySynthetic Data Generation | CodeCode Available | 2 |
| Towards Realistic Generative 3D Face Models | Apr 24, 2023 | 3D Face ReconstructionFace Model | CodeCode Available | 2 |
| REaLTabFormer: Generating Realistic Relational and Tabular Data using Transformers | Feb 4, 2023 | Synthetic Data Generation | CodeCode Available | 2 |
| DigiFace-1M: 1 Million Digital Face Images for Face Recognition | Oct 5, 2022 | AttributeFace Recognition | CodeCode Available | 2 |
| Synthetic QA Corpora Generation with Roundtrip Consistency | Jun 12, 2019 | Question AnsweringQuestion Generation | CodeCode Available | 2 |
| Datasheets Aren't Enough: DataRubrics for Automated Quality Metrics and Accountability | Jun 2, 2025 | DescriptiveSynthetic Data Generation | CodeCode Available | 1 |
| dpmm: Differentially Private Marginal Models, a Library for Synthetic Tabular Data Generation | May 31, 2025 | Synthetic Data GenerationTabular Data Generation | CodeCode Available | 1 |
| Analysis and Evaluation of Synthetic Data Generation in Speech Dysfluency Detection | May 28, 2025 | DiversitySynthetic Data Generation | CodeCode Available | 1 |
| ConText-CIR: Learning from Concepts in Text for Composed Image Retrieval | May 27, 2025 | Image RetrievalRetrieval | CodeCode Available | 1 |
| V2V: Scaling Event-Based Vision through Efficient Video-to-Voxel Simulation | May 22, 2025 | Event-based visionOptical Flow Estimation | CodeCode Available | 1 |
| BLEUBERI: BLEU is a surprisingly effective reward for instruction following | May 16, 2025 | Instruction FollowingSynthetic Data Generation | CodeCode Available | 1 |
| RAGSynth: Synthetic Data for Robust and Faithful RAG Component Optimization | May 16, 2025 | RAGSynthetic Data Generation | CodeCode Available | 1 |
| Towards Ball Spin and Trajectory Analysis in Table Tennis Broadcast Videos via Physically Grounded Synthetic-to-Real Transfer | Apr 28, 2025 | Monocular 3D Object LocalizationSports Analytics | CodeCode Available | 1 |
| MEDIBENG WHISPER TINY: A FINE-TUNED CODE-SWITCHED BENGALI-ENGLISH TRANSLATOR FOR CLINICAL APPLICATIONS | Apr 25, 2025 | Clinical Language TranslationMachine Translation | CodeCode Available | 1 |
| A Comprehensive Survey of Synthetic Tabular Data Generation | Apr 23, 2025 | Privacy PreservingSurvey | CodeCode Available | 1 |
| GLiNER-BioMed: A Suite of Efficient Models for Open Biomedical Named Entity Recognition | Apr 1, 2025 | Computational Efficiencynamed-entity-recognition | CodeCode Available | 1 |
| Unraveling the Effects of Synthetic Data on End-to-End Autonomous Driving | Mar 23, 2025 | 3DGSAutonomous Driving | CodeCode Available | 1 |