| AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities | Nov 12, 2022 | Contrastive LearningCross-Modal Retrieval | CodeCode Available | 4 |
| Your Diffusion Model is Secretly a Zero-Shot Classifier | Mar 28, 2023 | Domain GeneralizationFine-Grained Image Classification | CodeCode Available | 2 |
| Learning Transferable Visual Models From Natural Language Supervision | Feb 26, 2021 | Action RecognitionBenchmarking | CodeCode Available | 2 |
| Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision | Feb 11, 2021 | Cross-Modal RetrievalFine-Grained Image Classification | CodeCode Available | 2 |
| CoCa: Contrastive Captioners are Image-Text Foundation Models | May 4, 2022 | Action ClassificationDecoder | CodeCode Available | 1 |
| Distilling Large Vision-Language Model with Out-of-Distribution Generalizability | Jul 6, 2023 | Few-Shot Image ClassificationImage Classification | CodeCode Available | 1 |
| EVA-CLIP: Improved Training Techniques for CLIP at Scale | Mar 27, 2023 | Image ClassificationRepresentation Learning | CodeCode Available | 1 |
| Florence: A New Foundation Model for Computer Vision | Nov 22, 2021 | Action ClassificationAction Recognition | CodeCode Available | 1 |
| InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks | Dec 21, 2023 | Image RetrievalImage-to-Text Retrieval | CodeCode Available | 1 |
| Learning Customized Visual Models with Retrieval-Augmented Knowledge | Jan 17, 2023 | Contrastive LearningRetrieval | CodeCode Available | 1 |
| The effectiveness of MAE pre-pretraining for billion-scale pretraining | Mar 23, 2023 | Action ClassificationAction Recognition | CodeCode Available | 1 |
| LiT: Zero-Shot Transfer with Locked-image text Tuning | Nov 15, 2021 | image-classificationImage Classification | CodeCode Available | 1 |
| Alternating Gradient Descent and Mixture-of-Experts for Integrated Multimodal Perception | May 10, 2023 | Classificationimage-classification | —Unverified | 0 |
| Combined Scaling for Zero-shot Transfer Learning | Nov 19, 2021 | ClassificationContrastive Learning | —Unverified | 0 |
| Learning Visual N-Grams from Web Data | Dec 29, 2016 | Language ModelingLanguage Modelling | —Unverified | 0 |
| M2-Encoder: Advancing Bilingual Image-Text Understanding by Large-scale Efficient Pretraining | Jan 29, 2024 | GPUzero-shot-classification | CodeCode Available | 0 |
| PaLI: A Jointly-Scaled Multilingual Language-Image Model | Sep 14, 2022 | DecoderFew-Shot Image Classification | CodeCode Available | 0 |
| EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters | Feb 6, 2024 | image-classificationImage Classification | CodeCode Available | 0 |
| Scaling Vision Transformers to 22 Billion Parameters | Feb 10, 2023 | Action ClassificationFairness | CodeCode Available | 0 |