SOTAVerified

Zero-Shot Image Classification

Zero-shot image classification is a technique in computer vision where a model can classify images into categories that were not present during training. This is achieved by leveraging semantic information about the categories, such as textual descriptions or relationships between classes.

Papers

Showing 150 of 111 papers

TitleStatusHype
CIBR: Cross-modal Information Bottleneck Regularization for Robust CLIP Generalization0
LRSCLIP: A Vision-Language Foundation Model for Aligning Remote Sensing Image with Longer TextCode1
Beyond the Visible: Multispectral Vision-Language Learning for Earth Observation0
Bayesian Test-Time Adaptation for Vision-Language Models0
MADS: Multi-Attribute Document Supervision for Zero-Shot Image Classification0
MedUnifier: Unifying Vision-and-Language Pre-training on Medical Data with Vision Generation Task using Discrete Visual Representations0
Cross the Gap: Exposing the Intra-modal Misalignment in CLIP via Modality InversionCode2
KPL: Training-Free Medical Knowledge Mining of Vision-Language ModelsCode0
Retaining Knowledge and Enhancing Long-Text Representations in CLIP through Dual-Teacher Distillation0
Post-hoc Probabilistic Vision-Language ModelsCode1
CLIP-PING: Boosting Lightweight Vision-Language Models with Proximus Intrinsic Neighbors Guidance0
TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives0
TaxaBind: A Unified Embedding Space for Ecological ApplicationsCode1
Retrieval-enriched zero-shot image classification in low-resource domains0
Multilingual Vision-Language Pre-training for the Remote Sensing DomainCode0
Altogether: Image Captioning via Re-aligning Alt-textCode0
Open-vocabulary vs. Closed-set: Best Practice for Few-shot Object Detection Considering Text DescribabilityCode0
Interpreting and Analysing CLIP's Zero-Shot Image Classification via Mutual KnowledgeCode1
CSA: Data-efficient Mapping of Unimodal Features to Multimodal Features0
LoGra-Med: Long Context Multi-Graph Alignment for Medical Vision-Language Model0
CLIP-MoE: Towards Building Mixture of Experts for CLIP with Diversified Multiplet UpcyclingCode2
DPA: Dual Prototypes Alignment for Unsupervised Adaptation of Vision-Language ModelsCode0
Do Vision-Language Foundational models show Robust Visual Perception?Code0
CoAPT: Context Attribute words for Prompt Tuning0
Unconstrained Open Vocabulary Image Classification: Zero-Shot Transfer from Text to Image via CLIP InversionCode0
Semantic Compositions Enhance Vision-Language Contrastive Learning0
PathGen-1.6M: 1.6 Million Pathology Image-text Pairs Generation through Multi-agent CollaborationCode2
Mitigate the Gap: Investigating Approaches for Improving Cross-Modal Alignment in CLIPCode2
WATT: Weight Average Test-Time Adaptation of CLIPCode2
BaFTA: Backprop-Free Test-Time Adaptation For Zero-Shot Vision-Language Models0
Mind's Eye: Image Recognition by EEG via Multimodal Similarity-Keeping Contrastive LearningCode1
Multimodal Adversarial Defense for Vision-Language Models by Leveraging One-To-Many Relationships0
It's Not a Modality Gap: Characterizing and Addressing the Contrastive Gap0
What Do You See? Enhancing Zero-Shot Image Classification with Multimodal Large Language ModelsCode0
Who's in and who's out? A case study of multimodal CLIP-filtering in DataCompCode0
Multi-method Integration with Confidence-based Weighting for Zero-shot Image Classification0
MoDE: CLIP Data Experts via ClusteringCode0
A Progressive Framework of Vision-language Knowledge Distillation and Alignment for Multilingual Scene0
Sparse Concept Bottleneck Models: Gumbel Tricks in Contrastive LearningCode1
Learn "No" to Say "Yes" Better: Improving Vision-Language Models via NegationsCode1
Bridge the Modality and Capability Gaps in Vision-Language Model Selection0
Can We Talk Models Into Seeing the World Differently?Code1
PromptKD: Unsupervised Prompt Distillation for Vision-Language ModelsCode3
Exploring Low-Resource Medical Image Classification with Weakly Supervised Prompt Learning0
Image-Caption Encoding for Improving Zero-Shot GeneralizationCode0
Segment Any ChangeCode0
CLAMP: Contrastive LAnguage Model Prompt-tuning0
LightCLIP: Learning Multi-Level Interaction for Lightweight Vision-Language Models0
Towards Difficulty-Agnostic Efficient Transfer Learning for Vision-Language ModelsCode0
Efficient Model-Agnostic Multi-Group Equivariant Networks0
Show:102550
← PrevPage 1 of 3Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1OpenClip H/14 (34B)(Laion2B)Top-1 accuracy30.01Unverified
#ModelMetricClaimedVerifiedStatus
1CLIP (ViT B-32)Average Score56.64Unverified
#ModelMetricClaimedVerifiedStatus
1GLIP (Tiny A)Average Score11.4Unverified