SOTAVerified

Descriptive

Papers

Showing 150 of 1477 papers

TitleStatusHype
Visually Descriptive Language Model for Vector Graphics ReasoningCode9
T-Rex2: Towards Generic Object Detection via Text-Visual Prompt SynergyCode7
AudioGen: Textually Guided Audio GenerationCode6
Fundamental Components of Deep Learning: A category-theoretic approachCode5
ReMEmbR: Building and Reasoning Over Long-Horizon Spatio-Temporal Memory for Robot NavigationCode3
A Survey on Self-Supervised Learning for Non-Sequential Tabular DataCode3
Descriptive Image Quality Assessment in the WildCode3
Tokenization, Fusion, and Augmentation: Towards Fine-grained Multi-modal Entity RepresentationCode3
Remote Sensing Temporal Vision-Language Models: A Comprehensive SurveyCode3
Ultra-High-Resolution Image Synthesis: Data, Method and EvaluationCode3
Fine-Tuning Language Models from Human PreferencesCode3
ViLa-MIL: Dual-scale Vision-Language Multiple Instance Learning for Whole Slide Image ClassificationCode2
Zero-Shot ECG Classification with Multimodal Learning and Test-time Clinical Knowledge EnhancementCode2
What the DAAM: Interpreting Stable Diffusion Using Cross AttentionCode2
TrafficVLM: A Controllable Visual Language Model for Traffic Video CaptioningCode2
Teaching LMMs for Image Quality Scoring and InterpretingCode2
Towards Language Models That Can See: Computer Vision Through the LENS of Natural LanguageCode2
Unveiling Parts Beyond Objects: Towards Finer-Granularity Referring Expression SegmentationCode2
RuleKit 2: Faster and simpler rule learningCode2
SimVG: A Simple Framework for Visual Grounding with Decoupled Multi-modal FusionCode2
An Item is Worth a Prompt: Versatile Image Editing with Disentangled ControlCode2
SensorLLM: Human-Intuitive Alignment of Multivariate Sensor Data with LLMs for Activity RecognitionCode2
Customization Assistant for Text-to-image GenerationCode2
TeCH: Text-guided Reconstruction of Lifelike Clothed HumansCode2
PointCLIP V2: Prompting CLIP and GPT for Powerful 3D Open-world LearningCode2
ReID5o: Achieving Omni Multi-modal Person Re-identification in a Single ModelCode2
VAU-R1: Advancing Video Anomaly Understanding via Reinforcement Fine-TuningCode2
Video-STaR: Self-Training Enables Video Instruction Tuning with Any SupervisionCode2
Solving Data Quality Problems with Desbordante: a DemoCode2
What does a platypus look like? Generating customized prompts for zero-shot image classificationCode2
Language-driven Semantic SegmentationCode2
FlashSloth: Lightning Multimodal Large Language Models via Embedded Visual CompressionCode2
MedCalc-Bench: Evaluating Large Language Models for Medical CalculationsCode2
DGR-MIL: Exploring Diverse Global Representation in Multiple Instance Learning for Whole Slide Image ClassificationCode2
Fine-grained Image Captioning with CLIP RewardCode2
FlashSloth : Lightning Multimodal Large Language Models via Embedded Visual CompressionCode2
GRiT: A Generative Region-to-text Transformer for Object UnderstandingCode2
K-LITE: Learning Transferable Visual Models with External KnowledgeCode2
CausalVQA: A Physically Grounded Causal Reasoning Benchmark for Video ModelsCode2
Q-Insight: Understanding Image Quality via Visual Reinforcement LearningCode2
Depicting Beyond Scores: Advancing Image Quality Assessment through Multi-modal Language ModelsCode2
AmadeusGPT: a natural language interface for interactive animal behavioral analysisCode2
RS-Agent: Automating Remote Sensing Tasks through Intelligent AgentCode2
Composed Image Retrieval for Remote SensingCode2
Scalable 3D Captioning with Pretrained ModelsCode2
SCAMPS: Synthetics for Camera Measurement of Physiological SignalsCode2
SonicVerse: Multi-Task Learning for Music Feature-Informed CaptioningCode2
SpeechCraft: A Fine-grained Expressive Speech Dataset with Natural Language DescriptionCode2
Deep Graph Matching under Quadratic ConstraintCode1
Deep Implicit Statistical Shape Models for 3D Medical Image DelineationCode1
Show:102550
← PrevPage 1 of 30Next →

No leaderboard results yet.