| In Defense of Online Models for Video Instance Segmentation | Jul 21, 2022 | Contrastive LearningInstance Segmentation | CodeCode Available | 2 |
| More Agents Is All You Need | Feb 3, 2024 | All | CodeCode Available | 2 |
| You Only Look at Screens: Multimodal Chain-of-Action Agents | Sep 20, 2023 | Type prediction | CodeCode Available | 2 |
| NoLiMa: Long-Context Evaluation Beyond Literal Matching | Feb 7, 2025 | | CodeCode Available | 2 |
| MasRouter: Learning to Route LLMs for Multi-Agent Systems | Feb 16, 2025 | HumanEvalmbpp | CodeCode Available | 2 |
| LightningDrag: Lightning Fast and Accurate Drag-based Image Editing Emerging from Videos | May 22, 2024 | | CodeCode Available | 2 |
| Are We There Yet? A Brief Survey of Music Emotion Prediction Datasets, Models and Outstanding Challenges | Jun 13, 2024 | Emotion RecognitionMusic Emotion Recognition | CodeCode Available | 2 |
| STimage-1K4M: A histopathology image-gene expression dataset for spatial transcriptomics | Jun 10, 2024 | | CodeCode Available | 2 |
| What the DAAM: Interpreting Stable Diffusion Using Cross Attention | Oct 10, 2022 | DenoisingDescriptive | CodeCode Available | 2 |
| Video Diffusion Models: A Survey | May 6, 2024 | SurveyText-to-Video Generation | CodeCode Available | 2 |
| LKM-UNet: Large Kernel Vision Mamba UNet for Medical Image Segmentation | Mar 12, 2024 | Image SegmentationLong-range modeling | CodeCode Available | 2 |
| Distill Visual Chart Reasoning Ability from LLMs to MLLMs | Oct 24, 2024 | Multimodal ReasoningVisual Reasoning | CodeCode Available | 2 |
| UniMSE: Towards Unified Multimodal Sentiment Analysis and Emotion Recognition | Nov 21, 2022 | Contrastive LearningEmotion Recognition | CodeCode Available | 2 |
| Flow Matching for Medical Image Synthesis: Bridging the Gap Between Speed and Quality | Mar 1, 2025 | Image EnhancementImage Generation | CodeCode Available | 2 |
| Transformer-VQ: Linear-Time Transformers via Vector Quantization | Sep 28, 2023 | 8kDecoder | CodeCode Available | 2 |
| CoLLiE: Collaborative Training of Large Language Models in an Efficient Way | Dec 1, 2023 | GPUparameter-efficient fine-tuning | CodeCode Available | 2 |
| BAMM: Bidirectional Autoregressive Motion Model | Mar 28, 2024 | Denoisingmodel | CodeCode Available | 2 |
| Vision Language Action Models in Robotic Manipulation: A Systematic Review | Jul 14, 2025 | Dataset GenerationNatural Language Understanding | CodeCode Available | 2 |
| AnomalyDiffusion: Few-Shot Anomaly Image Generation with Diffusion Model | Dec 10, 2023 | Image Generation | CodeCode Available | 2 |
| Search, Verify and Feedback: Towards Next Generation Post-training Paradigm of Foundation Models via Verifier Engineering | Nov 18, 2024 | | CodeCode Available | 2 |
| The AdEMAMix Optimizer: Better, Faster, Older | Sep 5, 2024 | image-classificationImage Classification | CodeCode Available | 2 |
| CDDFuse: Correlation-Driven Dual-Branch Feature Decomposition for Multi-Modality Image Fusion | Nov 26, 2022 | object-detectionObject Detection | CodeCode Available | 2 |
| TIP: Tabular-Image Pre-training for Multimodal Classification with Incomplete Data | Jul 10, 2024 | Contrastive Learningmultimodal interaction | CodeCode Available | 2 |
| Automated Self-Supervised Learning for Recommendation | Mar 14, 2023 | Collaborative FilteringContrastive Learning | CodeCode Available | 2 |
| BAT: Benchmark for Auto-bidding Task | May 13, 2025 | | CodeCode Available | 2 |
| ONCE-3DLanes: Building Monocular 3D Lane Detection | Apr 30, 2022 | 3D Lane DetectionAutonomous Driving | CodeCode Available | 2 |
| Task Me Anything | Jun 17, 2024 | 2kAttribute | CodeCode Available | 2 |
| Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model | May 15, 2024 | GPULanguage Modeling | CodeCode Available | 2 |
| Generative Enhancement for 3D Medical Images | Mar 19, 2024 | counterfactualImage Generation | CodeCode Available | 2 |
| SAM-Med3D: Towards General-purpose Segmentation Models for Volumetric Medical Images | Oct 23, 2023 | 3D ArchitectureImage Segmentation | CodeCode Available | 2 |
| Less is More: Efficient Black-box Attribution via Minimal Interpretable Subset Selection | Apr 1, 2025 | | CodeCode Available | 2 |
| Multi-modal Situated Reasoning in 3D Scenes | Sep 4, 2024 | 3D Question Answering (3D-QA) | CodeCode Available | 2 |
| Semantic-Aligned Matching for Enhanced DETR Convergence and Multi-Scale Feature Fusion | Jul 28, 2022 | Objectobject-detection | CodeCode Available | 2 |
| SeerAttention-R: Sparse Attention Adaptation for Long Reasoning | Jun 10, 2025 | 4kGPU | CodeCode Available | 2 |
| PsyDraw: A Multi-Agent Multimodal System for Mental Health Screening in Left-Behind Children | Dec 19, 2024 | | CodeCode Available | 2 |
| Towards Practical Second-Order Optimizers in Deep Learning: Insights from Fisher Information Analysis | Apr 26, 2025 | Computational Efficiencyimage-classification | CodeCode Available | 2 |
| Towards 3D Molecule-Text Interpretation in Language Models | Jan 25, 2024 | Instruction FollowingLanguage Modeling | CodeCode Available | 2 |
| Sketch Video Synthesis | Nov 26, 2023 | Video Editing | CodeCode Available | 2 |
| ML-Agent: Reinforcing LLM Agents for Autonomous Machine Learning Engineering | May 29, 2025 | Large Language ModelPrompt Engineering | CodeCode Available | 2 |
| Faceptor: A Generalist Model for Face Perception | Mar 14, 2024 | Age EstimationAttribute | CodeCode Available | 2 |
| Video Object Segmentation in Panoptic Wild Scenes | May 8, 2023 | ObjectSemantic Segmentation | CodeCode Available | 2 |
| Where do Large Vision-Language Models Look at when Answering Questions? | Mar 18, 2025 | Question AnsweringVisual Question Answering | CodeCode Available | 2 |
| Video-Based Human Pose Regression via Decoupled Space-Time Aggregation | Mar 29, 2024 | Pose Estimationregression | CodeCode Available | 2 |
| Prototypical Information Bottlenecking and Disentangling for Multimodal Cancer Survival Prediction | Jan 3, 2024 | DisentanglementSurvival Prediction | CodeCode Available | 2 |
| Tackling View-Dependent Semantics in 3D Language Gaussian Splatting | May 30, 2025 | 3D Scene ReconstructionScene Understanding | CodeCode Available | 2 |
| HumanOmniV2: From Understanding to Omni-Modal Reasoning with Context | Jun 26, 2025 | Large Language ModelMultimodal Reasoning | CodeCode Available | 2 |
| MMToM-QA: Multimodal Theory of Mind Question Answering | Jan 16, 2024 | Question AnsweringTheory of Mind Modeling | CodeCode Available | 2 |
| Auto-PyTorch Tabular: Multi-Fidelity MetaLearning for Efficient and Robust AutoDL | Jun 24, 2020 | AutoMLNeural Architecture Search | CodeCode Available | 2 |
| FlowDB a large scale precipitation, river, and flash flood dataset | Dec 21, 2020 | Multivariate Time Series Forecasting | CodeCode Available | 2 |
| Wavelet Diffusion Models are fast and scalable Image Generators | Nov 29, 2022 | BlockingImage Generation | CodeCode Available | 2 |