| A Survey on Efficient Vision-Language-Action Models | Feb 2, 2026 | | —Unverified | 2 |
| End-to-End Agentic RAG System Training for Traceable Diagnostic Reasoning | Feb 1, 2026 | | —Unverified | 2 |
| X-Coder: Advancing Competitive Programming with Fully Synthetic Tasks, Solutions, and Tests | Feb 1, 2026 | | —Unverified | 2 |
| On the Design of One-step Diffusion via Shortcutting Flow Paths | Feb 1, 2026 | | —Unverified | 2 |
| Kanade: A Simple Disentangled Tokenizer for Spoken Language Modeling | Jan 31, 2026 | | —Unverified | 2 |
| Residual Context Diffusion Language Models | Jan 30, 2026 | | —Unverified | 2 |
| Shaping capabilities with token-level data filtering | Jan 30, 2026 | | —Unverified | 2 |
| Exploring Reasoning Reward Model for Agents | Jan 29, 2026 | | —Unverified | 2 |
| Drive-JEPA: Video JEPA Meets Multimodal Trajectory Distillation for End-to-End Driving | Jan 29, 2026 | | —Unverified | 2 |
| Wikontic: Constructing Wikidata-Aligned, Ontology-Aware Knowledge Graphs with Large Language Models | Jan 29, 2026 | | —Unverified | 2 |
| DynamicVLA: A Vision-Language-Action Model for Dynamic Object Manipulation | Jan 29, 2026 | | —Unverified | 2 |
| Everything in Its Place: Benchmarking Spatial Intelligence of Text-to-Image Models | Jan 29, 2026 | | —Unverified | 2 |
| Efficient Autoregressive Video Diffusion with Dummy Head | Jan 28, 2026 | | —Unverified | 2 |
| WorldVQA: Measuring Atomic World Knowledge in Multimodal Large Language Models | Jan 28, 2026 | | —Unverified | 2 |
| AdaReasoner: Dynamic Tool Orchestration for Iterative Visual Reasoning | Jan 28, 2026 | | —Unverified | 2 |
| Scaling Behavior Cloning Improves Causal Reasoning: An Open Model for Real-Time Video Game Playing | Jan 28, 2026 | | —Unverified | 2 |
| Youtu-Parsing: Perception, Structuring and Recognition via High-Parallelism Decoding | Jan 28, 2026 | | —Unverified | 2 |
| Innovator-VL: A Multimodal Large Language Model for Scientific Discovery | Jan 27, 2026 | | —Unverified | 2 |
| Youtu-VL: Unleashing Visual Potential via Unified Vision-Language Supervision | Jan 27, 2026 | | —Unverified | 2 |
| Visual Generation Unlocks Human-Like Reasoning through Multimodal World Models | Jan 27, 2026 | | —Unverified | 2 |
| daVinci-Dev: Agent-native Mid-training for Software Engineering | Jan 27, 2026 | | —Unverified | 2 |
| Towards Pixel-Level VLM Perception via Simple Points Prediction | Jan 27, 2026 | | —Unverified | 2 |
| Mitigating Attention Sinks and Massive Activations in Audio-Visual Speech Recognition with LLMs | Jan 27, 2026 | | —Unverified | 2 |
| Unleashing Scientific Reasoning for Bio-experimental Protocol Generation via Structured Component-based Reward Mechanism | Jan 27, 2026 | | —Unverified | 2 |
| DeFM: Learning Foundation Representations from Depth for Robotics | Jan 26, 2026 | | —Unverified | 2 |
| Self-Refining Video Sampling | Jan 26, 2026 | | —Unverified | 2 |
| HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding | Jan 26, 2026 | | —Unverified | 2 |
| Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control | Jan 26, 2026 | | —Unverified | 2 |
| SenseNova-MARS: Empowering Multimodal Agentic Reasoning and Search via Reinforcement Learning | Jan 25, 2026 | | —Unverified | 2 |
| BuildArena: A Physics-Aligned Interactive Benchmark of LLMs for Engineering Construction | Jan 24, 2026 | | —Unverified | 2 |
| Q-learning with Adjoint Matching | Jan 23, 2026 | | —Unverified | 2 |
| The Prism Hypothesis: Harmonizing Semantic and Pixel Representations via Unified Autoencoding | Jan 23, 2026 | | —Unverified | 2 |
| Stable-DiffCoder: Pushing the Frontier of Code Diffusion Large Language Model | Jan 23, 2026 | | —Unverified | 2 |
| VisGym: Diverse, Customizable, Scalable Environments for Multimodal Agents | Jan 23, 2026 | | —Unverified | 2 |
| Boosting Generative Image Modeling via Joint Image-Feature Synthesis | Jan 22, 2026 | | —Unverified | 2 |
| GutenOCR: A Grounded Vision-Language Front-End for Documents | Jan 22, 2026 | | —Unverified | 2 |
| SciArena: An Open Evaluation Platform for Non-Verifiable Scientific Literature-Grounded Tasks | Jan 22, 2026 | | —Unverified | 2 |
| BPMN Assistant: An LLM-Based Approach to Business Process Modeling | Jan 22, 2026 | | —Unverified | 2 |
| Rethinking Video Generation Model for the Embodied World | Jan 21, 2026 | | —Unverified | 2 |
| Adaptive Multi-Agent Reasoning via Automated Workflow Generation | Jul 18, 2025 | | CodeCode Available | 2 |
| SystolicAttention: Fusing FlashAttention within a Single Systolic Array | Jul 15, 2025 | Scheduling | CodeCode Available | 2 |
| CharaConsist: Fine-Grained Consistent Character Generation | Jul 15, 2025 | Consistent Character GenerationImage Generation | CodeCode Available | 2 |
| Alleviating Textual Reliance in Medical Language-guided Segmentation via Prototype-driven Semantic Approximation | Jul 15, 2025 | Image SegmentationSegmentation | CodeCode Available | 2 |
| Seq vs Seq: An Open Suite of Paired Encoders and Decoders | Jul 15, 2025 | DecoderLarge Language Model | CodeCode Available | 2 |
| DrafterBench: Benchmarking Large Language Models for Tasks Automation in Civil Engineering | Jul 15, 2025 | BenchmarkingInstruction Following | CodeCode Available | 2 |
| The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs | Jul 15, 2025 | Code GenerationSafety Alignment | CodeCode Available | 2 |
| MGVQ: Could VQ-VAE Beat VAE? A Generalizable Tokenizer with Multi-group Quantization | Jul 14, 2025 | 2kImage Generation | CodeCode Available | 2 |
| Vision Language Action Models in Robotic Manipulation: A Systematic Review | Jul 14, 2025 | Dataset GenerationNatural Language Understanding | CodeCode Available | 2 |
| I^2-World: Intra-Inter Tokenization for Efficient Dynamic 4D Scene Forecasting | Jul 12, 2025 | Autonomous DrivingComputational Efficiency | CodeCode Available | 2 |
| CompassJudger-2: Towards Generalist Judge Model via Verifiable Rewards | Jul 12, 2025 | | CodeCode Available | 2 |