SOTAVerified

Descriptive

Papers

Showing 150 of 1477 papers

TitleStatusHype
Visually Descriptive Language Model for Vector Graphics ReasoningCode9
T-Rex2: Towards Generic Object Detection via Text-Visual Prompt SynergyCode7
AudioGen: Textually Guided Audio GenerationCode6
Fundamental Components of Deep Learning: A category-theoretic approachCode5
Remote Sensing Temporal Vision-Language Models: A Comprehensive SurveyCode3
Fine-Tuning Language Models from Human PreferencesCode3
Tokenization, Fusion, and Augmentation: Towards Fine-grained Multi-modal Entity RepresentationCode3
Descriptive Image Quality Assessment in the WildCode3
A Survey on Self-Supervised Learning for Non-Sequential Tabular DataCode3
Ultra-High-Resolution Image Synthesis: Data, Method and EvaluationCode3
ReMEmbR: Building and Reasoning Over Long-Horizon Spatio-Temporal Memory for Robot NavigationCode3
Video-STaR: Self-Training Enables Video Instruction Tuning with Any SupervisionCode2
Zero-Shot ECG Classification with Multimodal Learning and Test-time Clinical Knowledge EnhancementCode2
What the DAAM: Interpreting Stable Diffusion Using Cross AttentionCode2
TrafficVLM: A Controllable Visual Language Model for Traffic Video CaptioningCode2
Teaching LMMs for Image Quality Scoring and InterpretingCode2
Towards Language Models That Can See: Computer Vision Through the LENS of Natural LanguageCode2
Unveiling Parts Beyond Objects: Towards Finer-Granularity Referring Expression SegmentationCode2
RuleKit 2: Faster and simpler rule learningCode2
SensorLLM: Human-Intuitive Alignment of Multivariate Sensor Data with LLMs for Activity RecognitionCode2
Q-Insight: Understanding Image Quality via Visual Reinforcement LearningCode2
SCAMPS: Synthetics for Camera Measurement of Physiological SignalsCode2
SonicVerse: Multi-Task Learning for Music Feature-Informed CaptioningCode2
TeCH: Text-guided Reconstruction of Lifelike Clothed HumansCode2
MedCalc-Bench: Evaluating Large Language Models for Medical CalculationsCode2
PointCLIP V2: Prompting CLIP and GPT for Powerful 3D Open-world LearningCode2
VAU-R1: Advancing Video Anomaly Understanding via Reinforcement Fine-TuningCode2
Customization Assistant for Text-to-image GenerationCode2
ViLa-MIL: Dual-scale Vision-Language Multiple Instance Learning for Whole Slide Image ClassificationCode2
What does a platypus look like? Generating customized prompts for zero-shot image classificationCode2
SimVG: A Simple Framework for Visual Grounding with Decoupled Multi-modal FusionCode2
GRiT: A Generative Region-to-text Transformer for Object UnderstandingCode2
FlashSloth : Lightning Multimodal Large Language Models via Embedded Visual CompressionCode2
FlashSloth: Lightning Multimodal Large Language Models via Embedded Visual CompressionCode2
K-LITE: Learning Transferable Visual Models with External KnowledgeCode2
DGR-MIL: Exploring Diverse Global Representation in Multiple Instance Learning for Whole Slide Image ClassificationCode2
AmadeusGPT: a natural language interface for interactive animal behavioral analysisCode2
Fine-grained Image Captioning with CLIP RewardCode2
Depicting Beyond Scores: Advancing Image Quality Assessment through Multi-modal Language ModelsCode2
An Item is Worth a Prompt: Versatile Image Editing with Disentangled ControlCode2
Language-driven Semantic SegmentationCode2
ReID5o: Achieving Omni Multi-modal Person Re-identification in a Single ModelCode2
CausalVQA: A Physically Grounded Causal Reasoning Benchmark for Video ModelsCode2
RS-Agent: Automating Remote Sensing Tasks through Intelligent AgentCode2
Composed Image Retrieval for Remote SensingCode2
Scalable 3D Captioning with Pretrained ModelsCode2
Solving Data Quality Problems with Desbordante: a DemoCode2
SpeechCraft: A Fine-grained Expressive Speech Dataset with Natural Language DescriptionCode2
Deep Graph Matching under Quadratic ConstraintCode1
Datasheets Aren't Enough: DataRubrics for Automated Quality Metrics and AccountabilityCode1
Show:102550
← PrevPage 1 of 30Next →

No leaderboard results yet.