| Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction | Apr 3, 2024 | Image GenerationImage Reconstruction | CodeCode Available | 9 | 5 |
| Visually Descriptive Language Model for Vector Graphics Reasoning | Apr 9, 2024 | DescriptiveLanguage Modeling | CodeCode Available | 9 | 5 |
| Marigold: Affordable Adaptation of Diffusion-Based Image Generators for Image Analysis | May 14, 2025 | DenoisingDepth Estimation | CodeCode Available | 7 | 5 |
| FoundationStereo: Zero-Shot Stereo Matching | Jan 17, 2025 | Depth EstimationDiversity | CodeCode Available | 7 | 5 |
| Large Concept Models: Language Modeling in a Sentence Representation Space | Dec 11, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 7 | 5 |
| Metric3Dv2: A Versatile Monocular Geometric Foundation Model for Zero-shot Metric Depth and Surface Normal Estimation | Mar 22, 2024 | Depth EstimationSurface Normal Estimation | CodeCode Available | 7 | 5 |
| ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs | Jul 31, 2023 | Trajectory PlanningZero-shot Generalization | CodeCode Available | 5 | 5 |
| Segment Anything for Videos: A Systematic Survey | Jul 31, 2024 | Image SegmentationRobot Manipulation Generalization | CodeCode Available | 5 | 5 |
| ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth | Feb 23, 2023 | Depth EstimationMonocular Depth Estimation | CodeCode Available | 5 | 5 |
| RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation | Oct 10, 2024 | Zero-shot Generalization | CodeCode Available | 5 | 5 |
| Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement | Mar 9, 2025 | Domain GeneralizationObject Detection | CodeCode Available | 4 | 5 |
| Metric3D: Towards Zero-shot Metric 3D Prediction from A Single Image | Jul 20, 2023 | Depth EstimationImage Reconstruction | CodeCode Available | 4 | 5 |
| Lotus: Diffusion-based Visual Foundation Model for High-quality Dense Prediction | Sep 26, 2024 | 3D ReconstructionDenoising | CodeCode Available | 4 | 5 |
| Zero-1-to-3: Zero-shot One Image to 3D Object | Mar 20, 2023 | 3D ReconstructionImage to 3D | CodeCode Available | 4 | 5 |
| Zero-Shot Whole-Body Humanoid Control via Behavioral Foundation Models | Apr 15, 2025 | Humanoid ControlReinforcement Learning (RL) | CodeCode Available | 4 | 5 |
| Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation | Dec 4, 2023 | Depth EstimationGPU | CodeCode Available | 4 | 5 |
| Parameter-Efficient Prompt Tuning Makes Generalized and Calibrated Neural Text Retrievers | Jul 14, 2022 | RetrievalText Retrieval | CodeCode Available | 4 | 5 |
| MonSter: Marry Monodepth to Stereo Unleashes Power | Jan 15, 2025 | Depth EstimationMonocular Depth Estimation | CodeCode Available | 4 | 5 |
| Expanding Language-Image Pretrained Models for General Video Recognition | Aug 4, 2022 | Action ClassificationAction Recognition | CodeCode Available | 3 | 5 |
| DEFOM-Stereo: Depth Foundation Model Based Stereo Matching | Jan 16, 2025 | Depth EstimationDisparity Estimation | CodeCode Available | 3 | 5 |
| Lag-Llama: Towards Foundation Models for Probabilistic Time Series Forecasting | Oct 12, 2023 | DecoderProbabilistic Time Series Forecasting | CodeCode Available | 3 | 5 |
| Detect Anything 3D in the Wild | Apr 10, 2025 | 3D Object DetectionAutonomous Driving | CodeCode Available | 3 | 5 |
| ZIM: Zero-Shot Image Matting for Anything | Nov 1, 2024 | Image InpaintingImage Matting | CodeCode Available | 3 | 5 |
| 3D Diffuser Actor: Policy Diffusion with 3D Scene Representations | Feb 18, 2024 | DenoisingRobot Manipulation | CodeCode Available | 3 | 5 |
| Stereo Anywhere: Robust Zero-Shot Deep Stereo Matching Even Where Either Stereo or Mono Fail | Dec 5, 2024 | Stereo MatchingZero-shot Generalization | CodeCode Available | 3 | 5 |
| What Language Model to Train if You Have One Million GPU Hours? | Oct 27, 2022 | GPULanguage Modeling | CodeCode Available | 3 | 5 |
| CLEAR: Conv-Like Linearization Revs Pre-Trained Diffusion Transformers Up | Dec 20, 2024 | 8kGPU | CodeCode Available | 3 | 5 |
| RobustSAM: Segment Anything Robustly on Degraded Images | Jun 13, 2024 | DeblurringImage Dehazing | CodeCode Available | 3 | 5 |
| Separate Anything You Describe | Aug 9, 2023 | Audio Source SeparationNatural Language Queries | CodeCode Available | 3 | 5 |
| MVMoE: Multi-Task Vehicle Routing Solver with Mixture-of-Experts | May 2, 2024 | Combinatorial OptimizationMixture-of-Experts | CodeCode Available | 3 | 5 |
| General Object Foundation Model for Images and Videos at Scale | Dec 14, 2023 | Instance SegmentationLong-tail Video Object Segmentation | CodeCode Available | 3 | 5 |
| PE3R: Perception-Efficient 3D Reconstruction | Mar 10, 2025 | 3D ReconstructionZero-shot Generalization | CodeCode Available | 3 | 5 |
| Objaverse-XL: A Universe of 10M+ 3D Objects | Jul 11, 2023 | DiversityNovel View Synthesis | CodeCode Available | 3 | 5 |
| Depth Any Camera: Zero-Shot Metric Depth Estimation from Any Camera | Jan 5, 2025 | Data AugmentationDepth Estimation | CodeCode Available | 3 | 5 |
| IEPile: Unearthing Large-Scale Schema-Based Information Extraction Corpus | Feb 22, 2024 | Zero-shot Generalization | CodeCode Available | 3 | 5 |
| SMART: Scalable Multi-agent Real-time Motion Generation via Next-token Prediction | May 24, 2024 | Autonomous DrivingMotion Generation | CodeCode Available | 3 | 5 |
| NeRF-Supervised Deep Stereo | Mar 30, 2023 | NeRFNeural Rendering | CodeCode Available | 2 | 5 |
| Exploring the Limits of Vision-Language-Action Manipulations in Cross-task Generalization | May 21, 2025 | Vision-Language-ActionZero-shot Generalization | CodeCode Available | 2 | 5 |
| Multitask Prompted Training Enables Zero-Shot Task Generalization | Oct 15, 2021 | BenchmarkingDecoder | CodeCode Available | 2 | 5 |
| Next Token Is Enough: Realistic Image Quality and Aesthetic Scoring with Multimodal Large Language Model | Mar 8, 2025 | Image Quality AssessmentLanguage Modeling | CodeCode Available | 2 | 5 |
| Crosslingual Generalization through Multitask Finetuning | Nov 3, 2022 | Coreference ResolutionCross-Lingual Transfer | CodeCode Available | 2 | 5 |
| Delineate Anything: Resolution-Agnostic Field Boundary Delineation on Satellite Imagery | Apr 3, 2025 | Field Boundary DelineationInstance Segmentation | CodeCode Available | 2 | 5 |
| Efficient Alignment of Unconditioned Action Prior for Language-conditioned Pick and Place in Clutter | Mar 12, 2025 | Zero-shot Generalization | CodeCode Available | 2 | 5 |
| Efficient Diffusion Transformer Policies with Mixture of Expert Denoisers for Multitask Learning | Dec 17, 2024 | Denoising | CodeCode Available | 2 | 5 |
| Meta-DT: Offline Meta-RL as Conditional Sequence Modeling with World Model Disentanglement | Oct 15, 2024 | DisentanglementInductive Bias | CodeCode Available | 2 | 5 |
| No "Zero-Shot" Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance | Apr 4, 2024 | BenchmarkingImage Generation | CodeCode Available | 2 | 5 |
| LLM+P: Empowering Large Language Models with Optimal Planning Proficiency | Apr 22, 2023 | Zero-shot Generalization | CodeCode Available | 2 | 5 |
| BigBIO: A Framework for Data-Centric Biomedical Natural Language Processing | Jun 30, 2022 | DiversityLanguage Model Evaluation | CodeCode Available | 2 | 5 |
| BiGR: Harnessing Binary Latent Codes for Image Generation and Improved Visual Representation Capabilities | Oct 18, 2024 | Conditional Image GenerationImage Generation | CodeCode Available | 2 | 5 |
| Learning to Route Among Specialized Experts for Zero-Shot Generalization | Feb 8, 2024 | parameter-efficient fine-tuningZero-shot Generalization | CodeCode Available | 2 | 5 |