| TimeMarker: A Versatile Video-LLM for Long and Short Video Understanding with Superior Temporal Localization Ability | Nov 27, 2024 | Temporal LocalizationVideo Understanding | CodeCode Available | 2 |
| GaussianSpeech: Audio-Driven Gaussian Avatars | Nov 27, 2024 | 3DGS | CodeCode Available | 2 |
| Monocular Obstacle Avoidance Based on Inverse PPO for Fixed-wing UAVs | Nov 27, 2024 | Collision AvoidanceDeep Reinforcement Learning | CodeCode Available | 2 |
| TryOffDiff: Virtual-Try-Off via High-Fidelity Garment Reconstruction using Diffusion Models | Nov 27, 2024 | Garment ReconstructionImage Generation | CodeCode Available | 2 |
| vesselFM: A Foundation Model for Universal 3D Blood Vessel Segmentation | Nov 26, 2024 | Image SegmentationMedical Image Analysis | CodeCode Available | 2 |
| MotionLLaMA: A Unified Framework for Motion Synthesis and Comprehension | Nov 26, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 2 |
| Path-RAG: Knowledge-Guided Key Region Retrieval for Open-ended Pathology Visual Question Answering | Nov 26, 2024 | PrognosisQuestion Answering | CodeCode Available | 2 |
| Task Singular Vectors: Reducing Task Interference in Model Merging | Nov 26, 2024 | ClassificationImage Classification | CodeCode Available | 2 |
| Omegance: A Single Parameter for Various Granularities in Diffusion-Based Synthesis | Nov 26, 2024 | Denoising | CodeCode Available | 2 |
| Towards Stabilized and Efficient Diffusion Transformers through Long-Skip-Connections with Spectral Constraints | Nov 26, 2024 | DenoisingImage Generation | CodeCode Available | 2 |
| Pretrained LLM Adapted with LoRA as a Decision Transformer for Offline RL in Quantitative Trading | Nov 26, 2024 | Offline RLparameter-efficient fine-tuning | CodeCode Available | 2 |
| HyperSeg: Towards Universal Visual Segmentation with Large Language Model | Nov 26, 2024 | Language ModelingLarge Language Model | CodeCode Available | 2 |
| Boost 3D Reconstruction using Diffusion-based Monocular Camera Calibration | Nov 26, 2024 | 3D ReconstructionCamera Calibration | CodeCode Available | 2 |
| Collaborative Decoding Makes Visual Auto-Regressive Modeling Efficient | Nov 26, 2024 | GPUImage Generation | CodeCode Available | 2 |
| MWFormer: Multi-Weather Image Restoration Using Degradation-Aware Transformers | Nov 26, 2024 | Contrastive LearningImage Restoration | CodeCode Available | 2 |
| PassionSR: Post-Training Quantization with Adaptive Scale in One-Step Diffusion based Image Super-Resolution | Nov 26, 2024 | DenoisingImage Super-Resolution | CodeCode Available | 2 |
| Grounding-IQA: Multimodal Language Grounding Model for Image Quality Assessment | Nov 26, 2024 | Image Quality AssessmentQuestion Answering | CodeCode Available | 2 |
| OpenAD: Open-World Autonomous Driving Benchmark for 3D Object Detection | Nov 26, 2024 | 3D Object DetectionAutonomous Driving | CodeCode Available | 2 |
| DreamMix: Decoupling Object Attributes for Enhanced Editability in Customized Image Inpainting | Nov 26, 2024 | AttributeDiversity | CodeCode Available | 2 |
| TinyViM: Frequency Decoupling for Tiny Hybrid Vision Mamba | Nov 26, 2024 | image-classificationImage Classification | CodeCode Available | 2 |
| Monocular Lane Detection Based on Deep Learning: A Survey | Nov 25, 2024 | 3D Lane DetectionAutonomous Driving | CodeCode Available | 2 |
| Scaling Spike-driven Transformer with Efficient Spike Firing Approximation Training | Nov 25, 2024 | object-detectionObject Detection | CodeCode Available | 2 |
| UltraSam: A Foundation Model for Ultrasound using Large Open-Access Segmentation Datasets | Nov 25, 2024 | Segmentation | CodeCode Available | 2 |
| Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering | Nov 25, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 2 |
| Interpreting Object-level Foundation Models via Visual Precision Search | Nov 25, 2024 | Explainable Artificial Intelligence (XAI)Object | CodeCode Available | 2 |
| Open Vocabulary Monocular 3D Object Detection | Nov 25, 2024 | 3D Object DetectionMonocular 3D Object Detection | CodeCode Available | 2 |
| Probing the limitations of multimodal language models for chemistry and materials research | Nov 25, 2024 | Experimental DesignSpatial Reasoning | CodeCode Available | 2 |
| Exploring Discrete Flow Matching for 3D De Novo Molecule Generation | Nov 25, 2024 | | CodeCode Available | 2 |
| Ca2-VDM: Efficient Autoregressive Video Diffusion Model with Causal Generation and Cache Sharing | Nov 25, 2024 | DenoisingVideo Generation | CodeCode Available | 2 |
| Efficient Video Face Enhancement with Enhanced Spatial-Temporal Consistency | Nov 25, 2024 | QuantizationVideo Restoration | CodeCode Available | 2 |
| Edit Away and My Face Will not Stay: Personal Biometric Defense against Malicious Generative Editing | Nov 25, 2024 | Privacy Preserving | CodeCode Available | 2 |
| ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration | Nov 25, 2024 | AI AgentVisual Question Answering | CodeCode Available | 2 |
| UniPose: A Unified Multimodal Framework for Human Pose Comprehension, Generation and Editing | Nov 25, 2024 | | CodeCode Available | 2 |
| SplatFlow: Multi-View Rectified Flow Model for 3D Gaussian Splatting Synthesis | Nov 25, 2024 | 3D Generation3DGS | CodeCode Available | 2 |
| An End-to-End Robust Point Cloud Semantic Segmentation Network with Single-Step Conditional Diffusion Models | Nov 25, 2024 | DenoisingScene Understanding | CodeCode Available | 2 |
| Fancy123: One Image to High-Quality 3D Mesh Generation via Plug-and-Play Deformation | Nov 25, 2024 | Image to 3D | CodeCode Available | 2 |
| Preference Optimization for Reasoning with Pseudo Feedback | Nov 25, 2024 | GSM8KMath | CodeCode Available | 2 |
| MVGenMaster: Scaling Multi-View Generation from Any Image via 3D Priors Enhanced Diffusion Model | Nov 25, 2024 | Novel View Synthesis | CodeCode Available | 2 |
| Self-Calibrated CLIP for Training-Free Open-Vocabulary Segmentation | Nov 24, 2024 | Semantic Segmentation | CodeCode Available | 2 |
| LLaMA-MoE v2: Exploring Sparsity of LLaMA from Perspective of Mixture-of-Experts with Post-Training | Nov 24, 2024 | MathMixture-of-Experts | CodeCode Available | 2 |
| ResCLIP: Residual Attention for Training-free Dense Vision-language Inference | Nov 24, 2024 | AttributeSemantic Segmentation | CodeCode Available | 2 |
| Devils in Middle Layers of Large Vision-Language Models: Interpreting, Detecting and Mitigating Object Hallucinations via Attention Lens | Nov 23, 2024 | Hallucination | CodeCode Available | 2 |
| Towards Satellite Image Road Graph Extraction: A Global-Scale Dataset and A Novel Method | Nov 23, 2024 | Autonomous Driving | CodeCode Available | 2 |
| Large Language Model with Region-guided Referring and Grounding for CT Report Generation | Nov 23, 2024 | Computed Tomography (CT)Diagnostic | CodeCode Available | 2 |
| AeroGen: Enhancing Remote Sensing Object Detection with Diffusion-Driven Data Generation | Nov 23, 2024 | Data AugmentationDiversity | CodeCode Available | 2 |
| What Makes a Scene ? Scene Graph-based Evaluation and Feedback for Controllable Generation | Nov 23, 2024 | Image GenerationScene Generation | CodeCode Available | 2 |
| Gotta Hear Them All: Sound Source Aware Vision to Audio Generation | Nov 23, 2024 | AllAudio Generation | CodeCode Available | 2 |
| Steering Away from Harm: An Adaptive Approach to Defending Vision Language Model Against Jailbreaks | Nov 23, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 2 |
| Multi-Reranker: Maximizing performance of retrieval-augmented generation in the FinanceRAG challenge | Nov 23, 2024 | RAGRetrieval | CodeCode Available | 2 |
| A Survey on LLM-as-a-Judge | Nov 23, 2024 | Models AlignmentSurvey | CodeCode Available | 2 |