| A Bi-directional Transformer for Musical Chord Recognition | Jul 5, 2019 | Chord RecognitionDescriptive | CodeCode Available | 1 | 5 |
| FlexConv: Continuous Kernel Convolutions with Differentiable Kernel Sizes | Oct 15, 2021 | DescriptiveImage Classification | CodeCode Available | 1 | 5 |
| InstructTTSEval: Benchmarking Complex Natural-Language Instruction Following in Text-to-Speech Systems | Jun 19, 2025 | BenchmarkingDescriptive | CodeCode Available | 1 | 5 |
| From Representation to Reasoning: Towards both Evidence and Commonsense Reasoning for Video Question-Answering | May 30, 2022 | counterfactualDescriptive | CodeCode Available | 1 | 5 |
| LaMOT: Language-Guided Multi-Object Tracking | Jun 12, 2024 | DescriptiveMulti-Object Tracking | CodeCode Available | 1 | 5 |
| General audio tagging with ensembling convolutional neural network and statistical features | Oct 30, 2018 | Audio TaggingDescriptive | CodeCode Available | 1 | 5 |
| Generating images from caption and vice versa via CLIP-Guided Generative Latent Space Search | Feb 2, 2021 | DescriptiveImage Generation | CodeCode Available | 1 | 5 |
| GL-RG: Global-Local Representation Granularity for Video Captioning | May 22, 2022 | Caption GenerationDescriptive | CodeCode Available | 1 | 5 |
| Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via Text-Only Training | Jan 4, 2024 | DescriptiveImage Captioning | CodeCode Available | 1 | 5 |
| Text-Guided Neural Image Inpainting | Apr 7, 2020 | DescriptiveImage Generation | CodeCode Available | 1 | 5 |