| Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese | Nov 2, 2022 | Contrastive Learningimage-classification | CodeCode Available | 5 | 5 |
| BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models | Jan 30, 2023 | Generative Visual Question AnsweringImage Captioning | CodeCode Available | 4 | 5 |
| Sigmoid Loss for Language Image Pre-Training | Mar 27, 2023 | Contrastive LearningDisentanglement | CodeCode Available | 3 | 5 |
| ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities | May 18, 2023 | 1 Image, 2*2 StitchiAction Classification | CodeCode Available | 3 | 5 |
| MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions | Mar 28, 2024 | Image RetrievalImplicit Relations | CodeCode Available | 3 | 5 |
| Linguistic-Aware Patch Slimming Framework for Fine-grained Cross-Modal Alignment | Jan 1, 2024 | cross-modal alignmentCross-Modal Retrieval | CodeCode Available | 2 | 5 |
| Learning Transferable Visual Models From Natural Language Supervision | Feb 26, 2021 | Action RecognitionBenchmarking | CodeCode Available | 2 | 5 |
| FLAVA: A Foundational Language And Vision Alignment Model | Dec 8, 2021 | Image RetrievalImage-to-Text Retrieval | CodeCode Available | 1 | 5 |
| ZSCRGAN: A GAN-based Expectation Maximization Model for Zero-Shot Retrieval of Images from Textual Descriptions | Jul 23, 2020 | Cross-Modal Information RetrievalImage Retrieval | CodeCode Available | 0 | 5 |
| M2-Encoder: Advancing Bilingual Image-Text Understanding by Large-scale Efficient Pretraining | Jan 29, 2024 | GPUzero-shot-classification | CodeCode Available | 0 | 5 |