Audio captioning
Audio Captioning is the task of describing audio using text. The general approach is to use an audio encoder to encode the audio (example: PANN, CAV-MAE), and to use a decoder (example: transformer) to generate the text. To judge the quality of audio captions, though machine translation metrics (BLEU, METEOR, ROUGE) and image captioning metrics (SPICE, CIDER) are used, they are not very well-suited. Attempts have been made to use pretrained language model based metrics such as Sentence-BERT.
Papers
Showing 1–10 of 119 papers
Benchmark Results
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | VAST | CIDEr | 0.78 | — | Unverified |
| 2 | VALOR | CIDEr | 0.74 | — | Unverified |
| 3 | MQ-Cap | SPIDEr | 0.52 | — | Unverified |
| 4 | SLAM-AAC | SPIDEr | 0.52 | — | Unverified |
| 5 | LAVCap | SPIDEr | 0.52 | — | Unverified |
| 6 | EnCLAP++-large | SPIDEr | 0.51 | — | Unverified |
| 7 | AutoCap | SPIDEr | 0.51 | — | Unverified |
| 8 | LOAE | SPIDEr | 0.51 | — | Unverified |
| 9 | EnCLAP++-base | SPIDEr | 0.5 | — | Unverified |
| 10 | EnCLAP-large | SPIDEr | 0.5 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | VAST | CIDEr | 0.52 | — | Unverified |
| 2 | VALOR | CIDEr | 0.42 | — | Unverified |
| 3 | SLAM-AAC | SPIDEr | 0.33 | — | Unverified |
| 4 | LOAE | SPIDEr | 0.33 | — | Unverified |
| 5 | MQ-Cap | SPIDEr | 0.32 | — | Unverified |
| 6 | Ensemble | SPIDEr | 0.32 | — | Unverified |
| 7 | Audio Flamingo (Pengi trainset) | SPIDEr | 0.31 | — | Unverified |
| 8 | Ensemble-RL | SPIDEr | 0.3 | — | Unverified |
| 9 | Qwen-Audio | SPIDEr | 0.29 | — | Unverified |
| 10 | Ensemble | SPIDEr | 0.21 | — | Unverified |