SOTAVerified

Retrieval-augmented few-shot in-context audio captioning is a specialized approach within the broader domain of audio captioning. This technique leverages the principles of few-shot in-context learning, akin to those used in LLMs, to generate textual descriptions for audio content without training on the dataset. Instead, during inference, the model utilizes a few-shot retrieval method where a few selected examples from the training data are presented in-context. This allows the model to generate accurate and contextually relevant captions based on limited input.

Retrieval-augmented Few-shot In-context Audio Captioning

Papers