| CodeScaler: Scaling Code LLM Training and Test-Time Inference via Execution-Free Reward Models | Feb 4, 2026 | | —Unverified | 1 |
| OmniRad: A Radiological Foundation Model for Multi-Task Medical Image Analysis | Feb 4, 2026 | | —Unverified | 1 |
| SynthVerse: A Large-Scale Diverse Synthetic Dataset for Point Tracking | Feb 4, 2026 | | —Unverified | 1 |
| Same or Not? Enhancing Visual Perception in Vision-Language Models | Feb 4, 2026 | | —Unverified | 1 |
| When Good Sounds Go Adversarial: Jailbreaking Audio-Language Models with Benign Inputs | Feb 4, 2026 | | —Unverified | 1 |
| daVinci-Agency: Unlocking Long-Horizon Agency Data-Efficiently | Feb 4, 2026 | | —Unverified | 1 |
| Evaluating and Steering Modality Preferences in Multimodal Large Language Model | Feb 4, 2026 | | —Unverified | 1 |
| SkeletonGaussian: Editable 4D Generation through Gaussian Skeletonization | Feb 4, 2026 | | —Unverified | 1 |
| LIVE: Long-horizon Interactive Video World Modeling | Feb 3, 2026 | | —Unverified | 1 |
| ObjEmbed: Towards Universal Multimodal Object Embeddings | Feb 3, 2026 | | —Unverified | 1 |
| Thinking with Comics: Enhancing Multimodal Reasoning through Structured Visual Storytelling | Feb 3, 2026 | | —Unverified | 1 |
| TIDE: Trajectory-based Diagnostic Evaluation of Test-Time Improvement in LLM Agents | Feb 3, 2026 | | —Unverified | 1 |
| Privasis: Synthesizing the Largest "Public" Private Dataset from Scratch | Feb 3, 2026 | | —Unverified | 1 |
| PluRel: Synthetic Data unlocks Scaling Laws for Relational Foundation Models | Feb 3, 2026 | | —Unverified | 1 |
| VLS: Steering Pretrained Robot Policies via Vision-Language Models | Feb 3, 2026 | | —Unverified | 1 |
| DLLM-Searcher: Adapting Diffusion Large Language Model for Search Agents | Feb 3, 2026 | | —Unverified | 1 |
| Reasoning Cache: Continual Improvement Over Long Horizons via Short-Horizon RL | Feb 3, 2026 | | —Unverified | 1 |
| Joint Estimation of Piano Dynamics and Metrical Structure with a Multi-task Multi-Scale Network | Feb 3, 2026 | | —Unverified | 1 |
| GlimpRouter: Efficient Collaborative Inference by Glimpsing One Token of Thoughts | Feb 3, 2026 | | —Unverified | 1 |
| Training LLMs for Divide-and-Conquer Reasoning Elevates Test-Time Scalability | Feb 2, 2026 | | —Unverified | 1 |
| SWE-Exp: Experience-Driven Software Issue Resolution | Feb 2, 2026 | | —Unverified | 1 |
| LangMap: A Hierarchical Benchmark for Open-Vocabulary Goal Navigation | Feb 2, 2026 | | —Unverified | 1 |
| Hierarchical Entity-centric Reinforcement Learning with Factored Subgoal Diffusion | Feb 2, 2026 | | —Unverified | 1 |
| How Well Do Models Follow Visual Instructions? VIBE: A Systematic Benchmark for Visual Instruction-Driven Image Editing | Feb 2, 2026 | | —Unverified | 1 |
| Show, Don't Tell: Morphing Latent Reasoning into Image Generation | Feb 2, 2026 | | —Unverified | 1 |
| CUA-Skill: Develop Skills for Computer Using Agent | Feb 2, 2026 | | —Unverified | 1 |
| Hunt Instead of Wait: Evaluating Deep Data Research on Large Language Models | Feb 2, 2026 | | —Unverified | 1 |
| Glance and Focus Reinforcement for Pan-cancer Screening | Feb 2, 2026 | | —Unverified | 1 |
| WideSeek: Advancing Wide Research via Multi-Agent Scaling | Feb 2, 2026 | | —Unverified | 1 |
| FinCoT: Grounding Chain-of-Thought in Expert Financial Reasoning | Feb 2, 2026 | | —Unverified | 1 |
| From Directions to Regions: Decomposing Activations in Language Models via Local Geometry | Feb 2, 2026 | | —Unverified | 1 |
| CodeOCR: On the Effectiveness of Vision Language Models in Code Understanding | Feb 2, 2026 | | —Unverified | 1 |
| LRAgent: Efficient KV Cache Sharing for Multi-LoRA LLM Agents | Feb 1, 2026 | | —Unverified | 1 |
| Rethinking Selective Knowledge Distillation | Feb 1, 2026 | | —Unverified | 1 |
| HalluHard: A Hard Multi-Turn Hallucination Benchmark | Feb 1, 2026 | | —Unverified | 1 |
| Language-based Trial and Error Falls Behind in the Era of Experience | Jan 31, 2026 | | —Unverified | 1 |
| EntroPIC: Towards Stable Long-Term Training of LLMs via Entropy Stabilization with Proportional-Integral Control | Jan 31, 2026 | | —Unverified | 1 |
| ChatInject: Abusing Chat Templates for Prompt Injection in LLM Agents | Jan 30, 2026 | | —Unverified | 1 |
| TokenTrim: Inference-Time Token Pruning for Autoregressive Long Video Generation | Jan 30, 2026 | | —Unverified | 1 |
| Segment Any Events with Language | Jan 30, 2026 | | —Unverified | 1 |
| Rethinking LLM-as-a-Judge: Representation-as-a-Judge with Small Language Models via Semantic Capacity Asymmetry | Jan 30, 2026 | | —Unverified | 1 |
| DanQing: An Up-to-Date Large-Scale Chinese Vision-Language Pre-training Dataset | Jan 30, 2026 | | —Unverified | 1 |
| TwinBrainVLA: Unleashing the Potential of Generalist VLMs for Embodied Tasks via Asymmetric Mixture-of-Transformers | Jan 30, 2026 | | —Unverified | 1 |
| Mano: Restriking Manifold Optimization for LLM Training | Jan 30, 2026 | | —Unverified | 1 |
| Which Heads Matter for Reasoning? RL-Guided KV Cache Compression | Jan 30, 2026 | | —Unverified | 1 |
| PaperArena: An Evaluation Benchmark for Tool-Augmented Agentic Reasoning on Scientific Literature | Jan 30, 2026 | | —Unverified | 1 |
| DeepResearch Bench II: Diagnosing Deep Research Agents via Rubrics from Expert Report | Jan 30, 2026 | | —Unverified | 1 |
| Generation Enhances Understanding in Unified Multimodal Models via Multi-Representation Generation | Jan 30, 2026 | | —Unverified | 1 |
| AgentLongBench: A Controllable Long Benchmark For Long-Contexts Agents via Environment Rollouts | Jan 30, 2026 | | —Unverified | 1 |
| ConceptMoE: Adaptive Token-to-Concept Compression for Implicit Compute Allocation | Jan 29, 2026 | | —Unverified | 1 |