A Note on Statistically Accurate Tabular Data Generation Using Large Language Models

2025-05-05Code Available0· sign in to hype

Andrey Sidorenko

Code Available — Be the first to reproduce this paper.

Code

github.com/mostly-ai/paper-datallm-materials
OfficialIn papernone★ 0

Abstract

Large language models (LLMs) have shown promise in synthetic tabular data generation, yet existing methods struggle to preserve complex feature dependencies, particularly among categorical variables. This work introduces a probability-driven prompting approach that leverages LLMs to estimate conditional distributions, enabling more accurate and scalable data synthesis. The results highlight the potential of prompting probability distributions to enhance the statistical fidelity of LLM-generated tabular data.

Tasks

Tabular Data Generation

A Note on Statistically Accurate Tabular Data Generation Using Large Language Models

Code

Abstract

Tasks

Reproductions