| EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters | Feb 6, 2024 | image-classificationImage Classification | CodeCode Available | 0 |
| M2-Encoder: Advancing Bilingual Image-Text Understanding by Large-scale Efficient Pretraining | Jan 29, 2024 | GPUzero-shot-classification | CodeCode Available | 0 |
| InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks | Dec 21, 2023 | Image RetrievalImage-to-Text Retrieval | CodeCode Available | 1 |
| Distilling Large Vision-Language Model with Out-of-Distribution Generalizability | Jul 6, 2023 | Few-Shot Image ClassificationImage Classification | CodeCode Available | 1 |
| Alternating Gradient Descent and Mixture-of-Experts for Integrated Multimodal Perception | May 10, 2023 | Classificationimage-classification | —Unverified | 0 |
| Your Diffusion Model is Secretly a Zero-Shot Classifier | Mar 28, 2023 | Domain GeneralizationFine-Grained Image Classification | CodeCode Available | 2 |
| EVA-CLIP: Improved Training Techniques for CLIP at Scale | Mar 27, 2023 | Image ClassificationRepresentation Learning | CodeCode Available | 1 |
| The effectiveness of MAE pre-pretraining for billion-scale pretraining | Mar 23, 2023 | Action ClassificationAction Recognition | CodeCode Available | 1 |
| Scaling Vision Transformers to 22 Billion Parameters | Feb 10, 2023 | Action ClassificationFairness | CodeCode Available | 0 |
| Learning Customized Visual Models with Retrieval-Augmented Knowledge | Jan 17, 2023 | Contrastive LearningRetrieval | CodeCode Available | 1 |