| VMBench: A Benchmark for Perception-Aligned Video Motion Generation | Mar 13, 2025 | Motion GenerationVideo Generation | CodeCode Available | 2 |
| 3D Student Splatting and Scooping | Mar 13, 2025 | 3DGSNeural Rendering | CodeCode Available | 2 |
| OVTR: End-to-End Open-Vocabulary Multiple Object Tracking with Transformer | Mar 13, 2025 | Decodermultimodal interaction | CodeCode Available | 2 |
| GroundingSuite: Measuring Complex Multi-Granular Pixel Grounding | Mar 13, 2025 | DiversityLanguage Modeling | CodeCode Available | 2 |
| EEdit: Rethinking the Spatial and Temporal Redundancy for Efficient Image Editing | Mar 13, 2025 | | CodeCode Available | 2 |
| Multi-Modal Mamba Modeling for Survival Prediction (M4Survive): Adapting Joint Foundation Model Representations | Mar 13, 2025 | Computational EfficiencyMamba | CodeCode Available | 2 |
| ETCH: Generalizing Body Fitting to Clothed Humans via Equivariant Tightness | Mar 13, 2025 | 3D Human Pose Estimation3D Human Shape Estimation | CodeCode Available | 2 |
| 4D LangSplat: 4D Language Gaussian Splatting via Multimodal Large Language Models | Mar 13, 2025 | Large Language ModelObject | CodeCode Available | 2 |
| RoMA: Scaling up Mamba-based Foundation Models for Remote Sensing | Mar 13, 2025 | Computational EfficiencyMamba | CodeCode Available | 2 |
| A Frustratingly Simple Yet Highly Effective Attack Baseline: Over 90% Success Rate Against the Strong Black-box Models of GPT-4.5/4o/o1 | Mar 13, 2025 | | CodeCode Available | 2 |
| Bayesian Prompt Flow Learning for Zero-Shot Anomaly Detection | Mar 13, 2025 | Anomaly Detectionzero-shot anomaly detection | CodeCode Available | 2 |
| DriveLMM-o1: A Step-by-Step Reasoning Dataset and Large Multimodal Model for Driving Scenario Understanding | Mar 13, 2025 | 4kAutonomous Driving | CodeCode Available | 2 |
| OR-LLM-Agent: Automating Modeling and Solving of Operations Research Optimization Problem with Reasoning Large Language Model | Mar 13, 2025 | AI AgentLanguage Modeling | CodeCode Available | 2 |
| RI3D: Few-Shot Gaussian Splatting With Repair and Inpainting Diffusion Priors | Mar 13, 2025 | 3DGS | CodeCode Available | 2 |
| Autoregressive Image Generation with Randomized Parallel Decoding | Mar 13, 2025 | Conditional Image GenerationImage Generation | CodeCode Available | 2 |
| Exploring the best way for UAV visual localization under Low-altitude Multi-view Observation Condition: a Benchmark | Mar 12, 2025 | Image RetrievalRetrieval | CodeCode Available | 2 |
| SwapAnyone: Consistent and Realistic Video Synthesis for Swapping Any Person into Any Video | Mar 12, 2025 | Video Inpainting | CodeCode Available | 2 |
| Neighboring Autoregressive Modeling for Efficient Visual Generation | Mar 12, 2025 | Image GenerationText to Image Generation | CodeCode Available | 2 |
| Alias-Free Latent Diffusion Models:Improving Fractional Shift Equivariance of Diffusion Latent Space | Mar 12, 2025 | Image-to-Image TranslationVideo Editing | CodeCode Available | 2 |
| PISA Experiments: Exploring Physics Post-Training for Video Diffusion Models by Watching Stuff Drop | Mar 12, 2025 | DiagnosticVideo Generation | CodeCode Available | 2 |
| Teaching LMMs for Image Quality Scoring and Interpreting | Mar 12, 2025 | DescriptiveImage Quality Assessment | CodeCode Available | 2 |
| ReMA: Learning to Meta-think for LLMs with Multi-Agent Reinforcement Learning | Mar 12, 2025 | Multi-agent Reinforcement Learningreinforcement-learning | CodeCode Available | 2 |
| Manify: A Python Library for Learning Non-Euclidean Representations | Mar 12, 2025 | Representation Learning | CodeCode Available | 2 |
| Foundation Models for Spatio-Temporal Data Science: A Tutorial and Survey | Mar 12, 2025 | Management | CodeCode Available | 2 |
| Efficient Alignment of Unconditioned Action Prior for Language-conditioned Pick and Place in Clutter | Mar 12, 2025 | Zero-shot Generalization | CodeCode Available | 2 |
| KNighter: Transforming Static Analysis with LLM-Synthesized Checkers | Mar 12, 2025 | | CodeCode Available | 2 |
| CombatVLA: An Efficient Vision-Language-Action Model for Combat Tasks in 3D Action Role-Playing Games | Mar 12, 2025 | Decision MakingVision-Language-Action | CodeCode Available | 2 |
| OmniMamba: Efficient and Unified Multimodal Understanding and Generation via State Space Models | Mar 11, 2025 | GPUMamba | CodeCode Available | 2 |
| External Knowledge Injection for CLIP-Based Class-Incremental Learning | Mar 11, 2025 | class-incremental learningClass Incremental Learning | CodeCode Available | 2 |
| Mellow: a small audio language model for reasoning | Mar 11, 2025 | Audio captioningLanguage Modeling | CodeCode Available | 2 |
| TrackOcc: Camera-based 4D Panoptic Occupancy Tracking | Mar 11, 2025 | 3D Object TrackingObject Tracking | CodeCode Available | 2 |
| MMRL: Multi-Modal Representation Learning for Vision-Language Models | Mar 11, 2025 | Prompt EngineeringRepresentation Learning | CodeCode Available | 2 |
| Referring to Any Person | Mar 11, 2025 | Large Language ModelMultimodal Large Language Model | CodeCode Available | 2 |
| QuoTA: Query-oriented Token Assignment via CoT Query Decouple for Long Video Comprehension | Mar 11, 2025 | AutoMLDecoder | CodeCode Available | 2 |
| SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories | Mar 11, 2025 | Decision MakingInteractive Segmentation | CodeCode Available | 2 |
| LongProLIP: A Probabilistic Vision-Language Model with Long Context Text | Mar 11, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 2 |
| SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories | Mar 11, 2025 | Decision MakingInteractive Segmentation | CodeCode Available | 2 |
| "Principal Components" Enable A New Language of Images | Mar 11, 2025 | Decoder | CodeCode Available | 2 |
| A Neural Symbolic Model for Space Physics | Mar 11, 2025 | Large Language Modelmodel | CodeCode Available | 2 |
| GigaSLAM: Large-Scale Monocular SLAM with Hierarchical Gaussian Splats | Mar 11, 2025 | 3DGSNeRF | CodeCode Available | 2 |
| LightGen: Efficient Image Generation through Knowledge Distillation and Direct Preference Optimization | Mar 11, 2025 | GPUImage Generation | CodeCode Available | 2 |
| V-Max: A Reinforcement Learning Framework for Autonomous Driving | Mar 11, 2025 | Autonomous DrivingDecision Making | CodeCode Available | 2 |
| HiP-AD: Hierarchical and Multi-Granularity Planning with Deformable Attention for Autonomous Driving in a Single Decoder | Mar 11, 2025 | Autonomous DrivingBench2Drive | CodeCode Available | 2 |
| Parametric Point Cloud Completion for Polygonal Surface Reconstruction | Mar 11, 2025 | Point Cloud CompletionSurface Reconstruction | CodeCode Available | 2 |
| YOLOMG: Vision-based Drone-to-Drone Detection with Appearance and Pixel-Level Motion Fusion | Mar 10, 2025 | | CodeCode Available | 2 |
| When Large Vision-Language Model Meets Large Remote Sensing Imagery: Coarse-to-Fine Text-Guided Token Pruning | Mar 10, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 2 |
| Boosting the Generalization and Reasoning of Vision Language Models with Curriculum Reinforcement Learning | Mar 10, 2025 | | CodeCode Available | 2 |
| Seedream 2.0: A Native Chinese-English Bilingual Image Generation Foundation Model | Mar 10, 2025 | Image DescriptionImage Generation | CodeCode Available | 2 |
| DistiLLM-2: A Contrastive Approach Boosts the Distillation of LLMs | Mar 10, 2025 | Code GenerationInstruction Following | CodeCode Available | 2 |
| AR-Diffusion: Asynchronous Video Generation with Auto-Regressive Diffusion | Mar 10, 2025 | Video Generation | CodeCode Available | 2 |