Self-Foveate: Enhancing Diversity and Difficulty of Synthesized Instructions from Unsupervised Text via Multi-Level Foveation

2026-02-03Code Available0· sign in to hype

Mingzhe Li, Xin Lu, Yanyan Zhao

Code Available — Be the first to reproduce this paper.

Code

github.com/mubuky/self-foveate
OfficialIn paper★ 23

Abstract

Synthesizing high-quality instruction data from unsupervised text is a promising paradigm for training large language models (LLMs), yet automated methods for this task still exhibit significant limitations in the diversity and difficulty of synthesized instructions. To address these challenges, we propose Self-Foveate, an LLM-driven method for instruction synthesis. Inspired by hierarchical human visual perception, Self-Foveate introduces a "Micro-Scatter-Macro" multi-level foveation methodology that guides the extraction of textual information at three complementary granularities, from fine-grained details through cross-region connections to holistic patterns, thereby enhancing both the diversity and difficulty of synthesized instructions. Furthermore, a re-synthesis module is incorporated to improve the fidelity of instructions to source text and their overall quality. Comprehensive experiments across multiple unsupervised corpora and diverse model architectures demonstrate that Self-Foveate consistently outperforms existing methods. We publicly release our code at https://github.com/Mubuky/Self-Foveate

Self-Foveate: Enhancing Diversity and Difficulty of Synthesized Instructions from Unsupervised Text via Multi-Level Foveation

Code

Abstract

Reproductions