Dataset Generation

The task involves enhancing the training of target application (e.g. autonomous driving systems) by generating datasets of diverse and critical elements (e.g. traffic scenarios). Traditional methods rely on expensive and limited datasets, which often fail to capture rare but essential situations that can pose risks during testing.

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1–50 of 308 papers

Title	Date	Tasks	Status	Hype	Score
Better Synthetic Data by Retrieving and Transforming Existing Datasets	Apr 22, 2024	Dataset GenerationDiversity	CodeCode Available	7	5
Synthetic Dataset Generation for Adversarial Machine Learning Research	Jul 21, 2022	BIG-bench Machine LearningDataset Generation	CodeCode Available	6	5
Prompt2Model: Generating Deployable Models from Natural Language Instructions	Aug 23, 2023	Data-free Knowledge DistillationDataset Generation	CodeCode Available	4	5
AutoCoder: Enhancing Code Large Language Model with AIEV-Instruct	May 23, 2024	Class-level Code GenerationCode Completion	CodeCode Available	4	5
RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework	Aug 2, 2024	BenchmarkingDataset Generation	CodeCode Available	3	5
Hierarchical Lexical Graph for Enhanced Multi-Hop Retrieval	Jun 9, 2025	Dataset GenerationRAG	CodeCode Available	3	5
An Automated End-to-End Open-Source Software for High-Quality Text-to-Speech Dataset Generation	Feb 26, 2024	Dataset Generationtext-to-speech	CodeCode Available	2	5
UniGen: A Unified Framework for Textual Dataset Generation Using Large Language Models	Jun 27, 2024	AttributeBenchmarking	CodeCode Available	2	5
Vision Language Action Models in Robotic Manipulation: A Systematic Review	Jul 14, 2025	Dataset GenerationNatural Language Understanding	CodeCode Available	2	5
JAX-SPH: A Differentiable Smoothed Particle Hydrodynamics Framework	Mar 7, 2024	Dataset Generation	CodeCode Available	2	5
DataDream: Few-shot Guided Dataset Generation	Jul 15, 2024	ClassificationDataset Generation	CodeCode Available	2	5
Physics Informed Distillation for Diffusion Models	Nov 13, 2024	Dataset GenerationImage Generation	CodeCode Available	2	5
CellViT++: Energy-Efficient and Adaptive Cell Segmentation and Classification Using Foundation Models	Jan 9, 2025	Cell SegmentationDataset Generation	CodeCode Available	2	5
DatasetDM: Synthesizing Data with Perception Annotations Using Diffusion Models	Aug 11, 2023	Dataset GenerationDecoder	CodeCode Available	2	5
MultiCorrupt: A Multi-Modal Robustness Dataset and Benchmark of LiDAR-Camera Fusion for 3D Object Detection	Feb 18, 2024	3D Object DetectionDataset Generation	CodeCode Available	2	5
PEGASUS: Physically Enhanced Gaussian Splatting Simulation System for 6DoF Object Pose Dataset Generation	Jan 4, 2024	Dataset GenerationObject	CodeCode Available	1	5
Oasis: One Image is All You Need for Multimodal Instruction Data Synthesis	Mar 11, 2025	AllDataset Generation	CodeCode Available	1	5
Perceptual Loss for Robust Unsupervised Homography Estimation	Apr 20, 2021	Dataset GenerationHomography Estimation	CodeCode Available	1	5
MORSE-500: A Programmatically Controllable Video Benchmark to Stress-Test Multimodal Reasoning	Jun 5, 2025	Dataset GenerationMathematical Problem-Solving	CodeCode Available	1	5
Developing a Scalable Benchmark for Assessing Large Language Models in Knowledge Graph Engineering	Aug 31, 2023	BenchmarkingDataset Generation	CodeCode Available	1	5
DCFace: Synthetic Face Generation with Dual Condition Diffusion Model	Apr 14, 2023	Dataset GenerationFace Generation	CodeCode Available	1	5
Faithful Persona-based Conversational Dataset Generation with Large Language Models	Dec 15, 2023	ChatbotDataset Generation	CodeCode Available	1	5
CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Through Corpus Retrieval and Augmentation	Sep 3, 2024	Dataset GenerationQuestion Answering	CodeCode Available	1	5
Detecting Anti-Vaccine Users on Twitter	Oct 21, 2021	Dataset GenerationMisinformation	CodeCode Available	1	5
MK-SQuIT: Synthesizing Questions using Iterative Template-filling	Nov 4, 2020	Dataset GenerationMachine Translation	CodeCode Available	1	5
DiffuGen: Adaptable Approach for Generating Labeled Image Datasets using Stable Diffusion Models	Sep 1, 2023	Dataset GenerationImage Generation	CodeCode Available	1	5
Afro-MNIST: Synthetic generation of MNIST-style datasets for low-resource languages	Sep 28, 2020	BIG-bench Machine LearningDataset Generation	CodeCode Available	1	5
NeuroGraph: Benchmarks for Graph Machine Learning in Brain Connectomics	Jun 9, 2023	BenchmarkingDataset Generation	CodeCode Available	1	5
Fabricator: An Open Source Toolkit for Generating Labeled Training Data with Teacher LLMs	Sep 18, 2023	Dataset GenerationQuestion Answering	CodeCode Available	1	5
PADetBench: Towards Benchmarking Physical Attacks against Object Detection	Aug 17, 2024	Adversarial RobustnessBenchmarking	CodeCode Available	1	5
Learning-based NLOS Detection and Uncertainty Prediction of GNSS Observations with Transformer-Enhanced LSTM Network	Sep 1, 2023	Dataset GenerationState Estimation	CodeCode Available	1	5
Learning to Answer Visual Questions from Web Videos	May 10, 2022	Dataset GenerationQuestion Answering	CodeCode Available	1	5
Image Generation for Efficient Neural Network Training in Autonomous Drone Racing	Aug 6, 2020	Dataset GenerationEfficient Neural Network	CodeCode Available	1	5
ColabSfM: Collaborative Structure-from-Motion by Point Cloud Registration	Mar 21, 2025	Dataset GenerationPoint Cloud Registration	CodeCode Available	1	5
Improving Paraphrase Detection with the Adversarial Paraphrasing Task	Jun 14, 2021	Dataset GenerationParaphrase Identification	CodeCode Available	1	5
LIQUID: A Framework for List Question Answering Dataset Generation	Feb 3, 2023	Dataset GenerationQuestion Answering	CodeCode Available	1	5
Cephalo: Multi-Modal Vision-Language Models for Bio-Inspired Materials Analysis and Design	May 29, 2024	Dataset GenerationImage to text	CodeCode Available	1	5
Chip Placement with Diffusion Models	Jul 17, 2024	Dataset GenerationDenoising	CodeCode Available	1	5
Automated Multi-level Preference for MLLMs	May 18, 2024	Dataset GenerationHallucination	CodeCode Available	1	5
CamDiff: Camouflage Image Augmentation via Diffusion Model	Apr 11, 2023	Dataset GenerationImage Augmentation	CodeCode Available	1	5
HM3D-ABO: A Photo-realistic Dataset for Object-centric Multi-view 3D Reconstruction	Jun 24, 2022	3D ReconstructionCamera Pose Estimation	CodeCode Available	1	5
ICM-Assistant: Instruction-tuning Multimodal Large Language Models for Rule-based Explainable Image Content Moderation	Dec 24, 2024	Dataset Generation	CodeCode Available	1	5
Dataset Diffusion: Diffusion-based Synthetic Dataset Generation for Pixel-Level Semantic Segmentation	Sep 25, 2023	Dataset GenerationSegmentation	CodeCode Available	1	5
Global Tensor Motion Planning	Nov 28, 2024	Dataset GenerationDiversity	CodeCode Available	1	5
Bounding Box-Guided Diffusion for Synthesizing Industrial Images and Segmentation Map	May 6, 2025	Dataset GenerationSegmentation	CodeCode Available	1	5
CySecBench: Generative AI-based CyberSecurity-focused Prompt Dataset for Benchmarking Large Language Models	Jan 2, 2025	BenchmarkingComputer Security	CodeCode Available	1	5
Actionet: An Interactive End-To-End Platform For Task-Based Data Collection And Augmentation In 3D Environment	Oct 3, 2020	Dataset GenerationTask Planning	CodeCode Available	1	5
Generalizing Single-View 3D Shape Retrieval to Occlusions and Unseen Objects	Dec 31, 2023	3D Shape RetrievalDataset Generation	CodeCode Available	1	5
OpenLS-DGF: An Adaptive Open-Source Dataset Generation Framework for Machine Learning Tasks in Logic Synthesis	Nov 14, 2024	Dataset Generation	CodeCode Available	1	5
Forcing Diffuse Distributions out of Language Models	Apr 16, 2024	Dataset GenerationDiversity	CodeCode Available	1	5

Show:10 25 50

← PrevPage 1 of 7Next →

No leaderboard results yet.