SOTAVerified

Zero-Shot Image Classification

Zero-shot image classification is a technique in computer vision where a model can classify images into categories that were not present during training. This is achieved by leveraging semantic information about the categories, such as textual descriptions or relationships between classes.

Papers

Showing 150 of 111 papers

TitleStatusHype
Chinese CLIP: Contrastive Vision-Language Pretraining in ChineseCode5
AltCLIP: Altering the Language Encoder in CLIP for Extended Language CapabilitiesCode4
ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual ModelsCode4
PromptKD: Unsupervised Prompt Distillation for Vision-Language ModelsCode3
Cross the Gap: Exposing the Intra-modal Misalignment in CLIP via Modality InversionCode2
CLIP-MoE: Towards Building Mixture of Experts for CLIP with Diversified Multiplet UpcyclingCode2
PathGen-1.6M: 1.6 Million Pathology Image-text Pairs Generation through Multi-agent CollaborationCode2
Mitigate the Gap: Investigating Approaches for Improving Cross-Modal Alignment in CLIPCode2
WATT: Weight Average Test-Time Adaptation of CLIPCode2
CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense PredictionCode2
RemoteCLIP: A Vision Language Foundation Model for Remote SensingCode2
What does a platypus look like? Generating customized prompts for zero-shot image classificationCode2
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text SupervisionCode2
LRSCLIP: A Vision-Language Foundation Model for Aligning Remote Sensing Image with Longer TextCode1
Post-hoc Probabilistic Vision-Language ModelsCode1
TaxaBind: A Unified Embedding Space for Ecological ApplicationsCode1
Interpreting and Analysing CLIP's Zero-Shot Image Classification via Mutual KnowledgeCode1
Mind's Eye: Image Recognition by EEG via Multimodal Similarity-Keeping Contrastive LearningCode1
Sparse Concept Bottleneck Models: Gumbel Tricks in Contrastive LearningCode1
Learn "No" to Say "Yes" Better: Improving Vision-Language Models via NegationsCode1
Can We Talk Models Into Seeing the World Differently?Code1
PerceptionCLIP: Visual Classification by Inferring and Conditioning on ContextsCode1
PromptStyler: Prompt-driven Style Generation for Source-free Domain GeneralizationCode1
Distilling Large Vision-Language Model with Out-of-Distribution GeneralizabilityCode1
Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional UnderstandingCode1
Babel-ImageNet: Massively Multilingual Evaluation of Vision-and-Language RepresentationsCode1
CamDiff: Camouflage Image Augmentation via Diffusion ModelCode1
Structure Pretraining and Prompt Tuning for Knowledge Graph TransferCode1
CHiLS: Zero-Shot Image Classification with Hierarchical Label SetsCode1
LexLIP: Lexicon-Bottlenecked Language-Image Pre-Training for Large-Scale Image-Text Sparse RetrievalCode1
Reproducible scaling laws for contrastive language-image learningCode1
General Image Descriptors for Open World Image Retrieval using ViT CLIPCode1
Zero-Shot Temporal Action Detection via Vision-Language PromptingCode1
DUET: Cross-modal Semantic Grounding for Contrastive Zero-shot LearningCode1
Disentangled Ontology Embedding for Zero-shot LearningCode1
Masked Unsupervised Self-training for Label-free Image ClassificationCode1
CCMB: A Large-scale Chinese Cross-modal BenchmarkCode1
Zero-Shot Logit AdjustmentCode1
Exploring Hierarchical Graph Representation for Large-Scale Zero-Shot Image ClassificationCode1
A Simple Baseline for Open-Vocabulary Semantic Segmentation with Pre-trained Vision-language ModelCode1
LiT: Zero-Shot Transfer with Locked-image text TuningCode1
FILIP: Fine-grained Interactive Language-Image Pre-TrainingCode1
Benchmarking Knowledge-driven Zero-shot LearningCode1
Open-vocabulary Object Detection via Vision and Language Knowledge DistillationCode1
Generative Multi-Label Zero-Shot LearningCode1
CIBR: Cross-modal Information Bottleneck Regularization for Robust CLIP Generalization0
Beyond the Visible: Multispectral Vision-Language Learning for Earth Observation0
Bayesian Test-Time Adaptation for Vision-Language Models0
MADS: Multi-Attribute Document Supervision for Zero-Shot Image Classification0
MedUnifier: Unifying Vision-and-Language Pre-training on Medical Data with Vision Generation Task using Discrete Visual Representations0
Show:102550
← PrevPage 1 of 3Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1OpenClip H/14 (34B)(Laion2B)Top-1 accuracy30.01Unverified
#ModelMetricClaimedVerifiedStatus
1CLIP (ViT B-32)Average Score56.64Unverified
#ModelMetricClaimedVerifiedStatus
1GLIP (Tiny A)Average Score11.4Unverified