SOTAVerified

Zero-Shot Image Classification

Zero-shot image classification is a technique in computer vision where a model can classify images into categories that were not present during training. This is achieved by leveraging semantic information about the categories, such as textual descriptions or relationships between classes.

Papers

Showing 2650 of 111 papers

TitleStatusHype
Zero-Shot Logit AdjustmentCode1
General Image Descriptors for Open World Image Retrieval using ViT CLIPCode1
Distilling Large Vision-Language Model with Out-of-Distribution GeneralizabilityCode1
Sparse Concept Bottleneck Models: Gumbel Tricks in Contrastive LearningCode1
Benchmarking Knowledge-driven Zero-shot LearningCode1
LRSCLIP: A Vision-Language Foundation Model for Aligning Remote Sensing Image with Longer TextCode1
Exploring Hierarchical Graph Representation for Large-Scale Zero-Shot Image ClassificationCode1
LexLIP: Lexicon-Bottlenecked Language-Image Pre-Training for Large-Scale Image-Text Sparse RetrievalCode1
Masked Unsupervised Self-training for Label-free Image ClassificationCode1
Interpreting and Analysing CLIP's Zero-Shot Image Classification via Mutual KnowledgeCode1
Babel-ImageNet: Massively Multilingual Evaluation of Vision-and-Language RepresentationsCode1
FILIP: Fine-grained Interactive Language-Image Pre-TrainingCode1
LiT: Zero-Shot Transfer with Locked-image text TuningCode1
DUET: Cross-modal Semantic Grounding for Contrastive Zero-shot LearningCode1
Can We Talk Models Into Seeing the World Differently?Code1
A Simple Baseline for Open-Vocabulary Semantic Segmentation with Pre-trained Vision-language ModelCode1
Learn "No" to Say "Yes" Better: Improving Vision-Language Models via NegationsCode1
Mind's Eye: Image Recognition by EEG via Multimodal Similarity-Keeping Contrastive LearningCode1
Structure Pretraining and Prompt Tuning for Knowledge Graph TransferCode1
PromptStyler: Prompt-driven Style Generation for Source-free Domain GeneralizationCode1
A Progressive Framework of Vision-language Knowledge Distillation and Alignment for Multilingual Scene0
MedUnifier: Unifying Vision-and-Language Pre-training on Medical Data with Vision Generation Task using Discrete Visual Representations0
DiRaC-I: Identifying Diverse and Rare Training Classes for Zero-Shot Learning0
CSA: Data-efficient Mapping of Unimodal Features to Multimodal Features0
Bridge the Modality and Capability Gaps in Vision-Language Model Selection0
Show:102550
← PrevPage 2 of 5Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1OpenClip H/14 (34B)(Laion2B)Top-1 accuracy30.01Unverified
#ModelMetricClaimedVerifiedStatus
1CLIP (ViT B-32)Average Score56.64Unverified
#ModelMetricClaimedVerifiedStatus
1GLIP (Tiny A)Average Score11.4Unverified