Measuring diversity of synthetic prompts and data generated with fine-grained persona prompting

2025-05-23Unverified0· sign in to hype

Gauri Kambhatla, Chantal Shaib, Venkata Govindarajan

Unverified — Be the first to reproduce this paper.

Abstract

Fine-grained personas have recently been used for generating 'diverse' synthetic data for pre-training and supervised fine-tuning of Large Language Models (LLMs). In this work, we measure the diversity of persona-driven synthetically generated prompts and responses with a suite of lexical diversity and redundancy metrics. Firstly, we find that synthetic prompts/instructions are significantly less diverse than human-written ones. Next, we sample responses from LLMs of different sizes with fine-grained and coarse persona descriptions to investigate how much fine-grained detail in persona descriptions contribute to generated text diversity. We find that while persona-prompting does improve lexical diversity (especially with larger models), fine-grained detail in personas doesn't increase diversity noticeably.

Tasks

Diversity

Measuring diversity of synthetic prompts and data generated with fine-grained persona prompting

Abstract

Tasks

Reproductions