| Multimodal Whole Slide Foundation Model for Pathology | Nov 29, 2024 | Cross-Modal Retrievalmodel | CodeCode Available | 4 | 5 |
| Multi-label Cluster Discrimination for Visual Representation Learning | Jul 24, 2024 | Contrastive LearningImage-text Retrieval | CodeCode Available | 4 | 5 |
| Long-CLIP: Unlocking the Long-Text Capability of CLIP | Mar 22, 2024 | Image GenerationImage Retrieval | CodeCode Available | 4 | 5 |
| A Survey of State of the Art Large Vision Language Models: Alignment, Benchmark, Evaluations and Challenges | Jan 4, 2025 | FairnessHallucination | CodeCode Available | 4 | 5 |
| FG-CLIP: Fine-Grained Visual and Textual Alignment | May 8, 2025 | Image-text Retrievalobject-detection | CodeCode Available | 4 | 5 |
| LLM-Pruner: On the Structural Pruning of Large Language Models | May 19, 2023 | Text Generationzero-shot-classification | CodeCode Available | 3 | 5 |
| CorrCLIP: Reconstructing Correlations in CLIP with Off-the-Shelf Foundation Models for Open-Vocabulary Semantic Segmentation | Nov 15, 2024 | Open Vocabulary Semantic SegmentationOpen-Vocabulary Semantic Segmentation | CodeCode Available | 2 | 5 |
| TabLLM: Few-shot Classification of Tabular Data with Large Language Models | Oct 19, 2022 | ClassificationDeep Learning | CodeCode Available | 2 | 5 |
| VeCLIP: Improving CLIP Training via Visual-enriched Captions | Oct 11, 2023 | Image-text RetrievalRetrieval | CodeCode Available | 2 | 5 |
| CLIP-Mamba: CLIP Pretrained Mamba Models with OOD and Hessian Evaluation | Apr 30, 2024 | MambaState Space Models | CodeCode Available | 2 | 5 |
| Enhancing Remote Sensing Vision-Language Models for Zero-Shot Scene Classification | Sep 1, 2024 | Scene ClassificationTransductive Zero-Shot Classification | CodeCode Available | 2 | 5 |
| RS5M and GeoRSCLIP: A Large Scale Vision-Language Dataset and A Large Vision-Language Model for Remote Sensing | Jun 20, 2023 | Cross-Modal RetrievalImage Retrieval | CodeCode Available | 2 | 5 |
| DiffCLIP: Differential Attention Meets CLIP | Mar 9, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 2 | 5 |
| GeoVision Labeler: Zero-Shot Geospatial Classification with Vision and Language Models | May 30, 2025 | ClassificationDisaster Response | CodeCode Available | 2 | 5 |
| RWKV-CLIP: A Robust Vision-Language Representation Learner | Jun 11, 2024 | Image-text RetrievalRepresentation Learning | CodeCode Available | 2 | 5 |
| Uni3D: Exploring Unified 3D Representation at Scale | Oct 10, 2023 | 3D Object ClassificationRetrieval | CodeCode Available | 2 | 5 |
| Mitigate the Gap: Investigating Approaches for Improving Cross-Modal Alignment in CLIP | Jun 25, 2024 | cross-modal alignmentImage Classification | CodeCode Available | 2 | 5 |
| Boosting Vision-Language Models for Histopathology Classification: Predict all at once | Sep 3, 2024 | Allzero-shot-classification | CodeCode Available | 2 | 5 |
| Harnessing Explanations: LLM-to-LM Interpreter for Enhanced Text-Attributed Graph Representation Learning | May 31, 2023 | Decision MakingGeneral Knowledge | CodeCode Available | 2 | 5 |
| ULIP-2: Towards Scalable Multimodal Pre-training for 3D Understanding | May 14, 2023 | 3D Classification3D Point Cloud Classification | CodeCode Available | 2 | 5 |
| ZegCLIP: Towards Adapting CLIP for Zero-shot Semantic Segmentation | Dec 7, 2022 | Semantic Segmentationzero-shot-classification | CodeCode Available | 2 | 5 |
| Zero-Shot ECG Classification with Multimodal Learning and Test-time Clinical Knowledge Enhancement | Mar 11, 2024 | Clinical KnowledgeDescriptive | CodeCode Available | 2 | 5 |
| RemoteCLIP: A Vision Language Foundation Model for Remote Sensing | Jun 19, 2023 | ClassificationCross-Modal Retrieval | CodeCode Available | 2 | 5 |
| CARZero: Cross-Attention Alignment for Radiology Zero-Shot Classification | Feb 27, 2024 | ClassificationDiagnostic | CodeCode Available | 2 | 5 |
| Large-scale and Fine-grained Vision-language Pre-training for Enhanced CT Image Understanding | Jan 24, 2025 | AnatomyContrastive Learning | CodeCode Available | 2 | 5 |
| Patho-R1: A Multimodal Reinforcement Learning-Based Pathology Expert Reasoner | May 16, 2025 | Cross-Modal RetrievalDiagnostic | CodeCode Available | 2 | 5 |
| BIOMEDICA: An Open Biomedical Image-Caption Archive, Dataset, and Vision-Language Models Derived from Scientific Literature | Jan 13, 2025 | ArticlesImage-text Retrieval | CodeCode Available | 2 | 5 |
| Robust CLIP: Unsupervised Adversarial Fine-Tuning of Vision Embeddings for Robust Large Vision-Language Models | Feb 19, 2024 | Adversarial DefenseMultimodal Deep Learning | CodeCode Available | 2 | 5 |
| Your Diffusion Model is Secretly a Zero-Shot Classifier | Mar 28, 2023 | Domain GeneralizationFine-Grained Image Classification | CodeCode Available | 2 | 5 |
| Advancing Medical Representation Learning Through High-Quality Data | Mar 18, 2025 | Representation Learningzero-shot-classification | CodeCode Available | 1 | 5 |
| ArcheType: A Novel Framework for Open-Source Column Type Annotation using Large Language Models | Oct 27, 2023 | Column Type AnnotationTable annotation | CodeCode Available | 1 | 5 |
| CountCLIP -- [Re] Teaching CLIP to Count to Ten | Jun 5, 2024 | zero-shot-classificationZero-Shot Counting | CodeCode Available | 1 | 5 |
| Controlling Latent Diffusion Using Latent CLIP | Mar 11, 2025 | DenoisingDescriptive | CodeCode Available | 1 | 5 |
| Contrastive Language-Image Pre-training for the Italian Language | Aug 19, 2021 | Image RetrievalMulti-label zero-shot learning | CodeCode Available | 1 | 5 |
| Florence: A New Foundation Model for Computer Vision | Nov 22, 2021 | Action ClassificationAction Recognition | CodeCode Available | 1 | 5 |
| CLIP-VIS: Adapting CLIP for Open-Vocabulary Video Instance Segmentation | Mar 19, 2024 | DecoderInstance Segmentation | CodeCode Available | 1 | 5 |
| CLIPure: Purification in Latent Space via CLIP for Adversarially Robust Zero-Shot Classification | Feb 25, 2025 | Denoisingzero-shot-classification | CodeCode Available | 1 | 5 |
| Exploring Vision-Language Models for Imbalanced Learning | Apr 4, 2023 | Decoderzero-shot-classification | CodeCode Available | 1 | 5 |
| CLIP meets DINO for Tuning Zero-Shot Classifier using Unlabeled Image Collections | Nov 28, 2024 | image-classificationImage Classification | CodeCode Available | 1 | 5 |
| CLIP-Lite: Information Efficient Visual Representation Learning with Language Supervision | Dec 14, 2021 | Contrastive LearningRepresentation Learning | CodeCode Available | 1 | 5 |
| CLIP Under the Microscope: A Fine-Grained Analysis of Multi-Object Representation | Feb 27, 2025 | Image-text matchingObject | CodeCode Available | 1 | 5 |
| CyCLIP: Cyclic Contrastive Language-Image Pretraining | May 28, 2022 | Representation LearningVisual Reasoning | CodeCode Available | 1 | 5 |
| From Local Details to Global Context: Advancing Vision-Language Models with Attention-Based Selection | May 19, 2025 | feature selectionOut-of-Distribution Generalization | CodeCode Available | 1 | 5 |
| Efficient Medical Vision-Language Alignment Through Adapting Masked Vision Models | Jun 10, 2025 | Contrastive LearningImage-text matching | CodeCode Available | 1 | 5 |
| DST-Det: Simple Dynamic Self-Training for Open-Vocabulary Object Detection | Oct 2, 2023 | Novel Object DetectionObject | CodeCode Available | 1 | 5 |
| EmoCLIP: A Vision-Language Method for Zero-Shot Video Facial Expression Recognition | Oct 25, 2023 | Facial Expression RecognitionFacial Expression Recognition (FER) | CodeCode Available | 1 | 5 |
| Discovering Human Interactions With Novel Objects via Zero-Shot Learning | Jun 1, 2020 | Human-Object Interaction DetectionObject | CodeCode Available | 1 | 5 |
| CLIP-Guided Source-Free Object Detection in Aerial Images | Jan 10, 2024 | Domain AdaptationObject | CodeCode Available | 1 | 5 |
| CLIPArTT: Adaptation of CLIP to New Domains at Test Time | May 1, 2024 | Pseudo LabelTest-time Adaptation | CodeCode Available | 1 | 5 |
| Discriminative Region-based Multi-Label Zero-Shot Learning | Aug 20, 2021 | Image RetrievalMulti-label zero-shot learning | CodeCode Available | 1 | 5 |