| Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition | May 7, 2024 | Large Language ModelMultimodal Large Language Model | —Unverified | 0 |
| VideoOrion: Tokenizing Object Dynamics in Videos | Nov 25, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| VideoPoet: A Large Language Model for Zero-Shot Video Generation | Dec 21, 2023 | DecoderLanguage Modeling | —Unverified | 0 |
| Video Summarization with Large Language Models | Apr 15, 2025 | Large Language ModelVideo Summarization | —Unverified | 0 |
| Video-VoT-R1: An efficient video inference model integrating image packing and AoE architecture | Mar 20, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| ViLLM-Eval: A Comprehensive Evaluation Suite for Vietnamese Large Language Models | Apr 17, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Vi-Mistral-X: Building a Vietnamese Language Model with Advanced Continual Pre-training | Mar 20, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| VinaLLaMA: LLaMA-based Vietnamese Foundation Model | Dec 18, 2023 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Vintern-1B: An Efficient Multimodal Large Language Model for Vietnamese | Aug 22, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| ViPer: Visual Personalization of Generative Models via Individual Preference Learning | Jul 24, 2024 | Image GenerationLanguage Modeling | —Unverified | 0 |
| VisCon-100K: Leveraging Contextual Web Data for Fine-tuning Vision Language Models | Feb 14, 2025 | Image CaptioningLarge Language Model | —Unverified | 0 |
| Vision and Intention Boost Large Language Model in Long-Term Action Anticipation | May 3, 2025 | Action AnticipationIn-Context Learning | —Unverified | 0 |
| Vision-Based Generic Potential Function for Policy Alignment in Multi-Agent Reinforcement Learning | Feb 19, 2025 | Common Sense ReasoningLanguage Modeling | —Unverified | 0 |
| Vision-centric Token Compression in Large Language Model | Feb 2, 2025 | In-Context LearningLanguage Modeling | —Unverified | 0 |
| VisionGPT: Vision-Language Understanding Agent Using Generalized Multimodal Framework | Mar 14, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Vision-Integrated LLMs for Autonomous Driving Assistance : Human Performance Comparison and Trust Evaluation | Feb 6, 2025 | Autonomous DrivingDecision Making | —Unverified | 0 |
| Vision-Language Models Represent Darker-Skinned Black Individuals as More Homogeneous than Lighter-Skinned Black Individuals | Dec 12, 2024 | Image CaptioningImage Generation | —Unverified | 0 |
| VisionLLM-based Multimodal Fusion Network for Glottic Carcinoma Early Detection | Dec 24, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| [Vision Paper] PRObot: Enhancing Patient-Reported Outcome Measures for Diabetic Retinopathy using Chatbots and Generative AI | Nov 5, 2024 | ChatbotLanguage Modeling | —Unverified | 0 |
| VisionTrap: Vision-Augmented Trajectory Prediction Guided by Textual Descriptions | Jul 17, 2024 | Autonomous VehiclesLanguage Modeling | —Unverified | 0 |
| Visual Adversarial Attack on Vision-Language Models for Autonomous Driving | Nov 27, 2024 | Adversarial AttackAutonomous Driving | —Unverified | 0 |
| Visual Clues: Bridging Vision and Language Foundations for Image Paragraph Captioning | Jun 3, 2022 | Image Paragraph CaptioningLanguage Modeling | —Unverified | 0 |
| Visual Delta Generator with Large Multi-modal Models for Semi-supervised Composed Image Retrieval | Apr 23, 2024 | Image RetrievalLanguage Modeling | —Unverified | 0 |
| Visual Echoes: A Simple Unified Transformer for Audio-Visual Generation | May 23, 2024 | Audio GenerationDenoising | —Unverified | 0 |
| Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation | Apr 30, 2024 | Caption GenerationHallucination | —Unverified | 0 |